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Abstract 


Scalable  probabilistic  reasoning  is  the  key  to  unlocking  the  full  potential  of  the  age  of 
big  data.  From  untangling  the  biological  processes  that  govern  cancer  to  effectively  targeting 
products  and  advertisements,  probabilistic  reasoning  is  how  we  make  sense  of  noisy  data 
and  turn  information  into  understanding  and  action.  Unfortunately,  the  algorithms  and  tools 
for  sophisticated  structured  probabilistic  reasoning  were  developed  for  the  sequential  Von 
Neumann  architecture  and  have  therefore  been  unable  to  scale  with  big  data.  In  this  thesis 
we  propose  a  simple  set  of  design  principles  to  guide  the  development  of  new  parallel  and 
distributed  algorithms  and  systems  for  scalable  probabilistic  reasoning.  We  then  apply  these 
design  principles  to  develop  a  series  of  new  algorithms  for  inference  in  probabilistic  graphical 
models  and  derive  theoretical  tools  to  characterize  the  parallel  properties  of  statistical  inference. 
We  implement  and  assess  the  efficiency  and  scalability  of  the  new  inference  algorithms  in  the 
multicore  and  distributed  settings  demonstrating  the  substantial  gains  from  applying  the  thesis 
methodology  to  real-world  probabilistic  reasoning. 

Based  on  the  lessons  learned  in  statistical  inference  we  introduce  the  GraphLab  parallel 
abstraction  which  generalizes  the  thesis  methodology  and  enable  the  rapid  development  of 
new  efficient  and  scalable  parallel  and  distributed  algorithms  for  probabilistic  reasoning.  We 
demonstrate  how  the  GraphLab  abstraction  can  be  used  to  rapidly  develop  new  scalable 
algorithms  for  probabilistic  reasoning  and  assess  their  performance  on  real-world  problems  in 
both  the  multicore  and  distributed  settings.  Finally,  we  identify  a  unique  challenge  associated 
with  the  underlying  graphical  structure  in  a  wide  range  of  probabilistic  reasoning  tasks.  To 
address  this  challenge  we  introduce  PowerGraph  which  refines  the  GraphLab  abstraction  and 
achieves  orders  of  magnitude  improvements  in  performance  relative  to  existing  systems. 
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Chapter  1 


Introduction 


Probabilistic  reasoning  forms  the  foundation  of  modern  statistics  and  machine  learning.  From  estimating 
the  chance  of  rain  to  predicting  a  user’s  shopping  interests  we  rely  on  probabilistic  reasoning  to  quantify 
uncertainty,  model  noisy  data,  and  understand  complex  phenomena.  Probabilistic  graphical  models  are  one 
of  the  most  influential  and  widely  used  techniques  for  complex  probabilistic  reasoning.  By  providing  a 
compact  representation  of  complex  phenomena  as  well  as  a  set  algorithmic  and  analytical  tools,  probabilistic 
graphical  models  have  been  used  to  successfully  untangle  the  structure  [Yanover  and  Weiss,  2002,  Yanover 
et  al.,  2007]  and  interactions  [Jaimovich  et  al.,  2006]  of  proteins,  decipher  the  meaning  of  written  language 
[Blei  et  al.,  2003,  Richardson  and  Domingos,  2006],  and  label  our  visual  world  [Li,  1995]. 

Historically,  exponential  gains  in  sequential  processor  technology  have  enabled  increasingly  sophisticated 
probabilistic  graphical  models  to  be  applied  to  a  wide  range  of  challenging  real-world  problems.  However, 
recent  physical  and  economic  limitations  have  forced  computer  architectures  away  from  exponential  se¬ 
quential  performance  scaling  and  towards  parallel  performance  scaling  [Asanovic  et  al.,  2006].  Meanwhile, 
improvements  in  processor  virtualization,  networking,  and  data-storage,  and  the  desire  to  catalog  every 
fact  and  human  experience,  has  lead  to  the  development  of  new  hardware  paradigms  like  cloud  computing 
and  massively  parallel  data-centers  which  fundamentally  challenge  the  future  of  sequential  algorithms 
[Armbrust  et  al.,  2010]. 

Unfortunately,  the  algorithms  and  techniques  used  to  analyze  and  apply  probabilistic  graphical  models 
were  developed  and  studied  in  the  sequential  Von  Neumann  setting.  As  a  consequence,  recent  efforts 
[Chu  et  al.,  2006,  Panda  et  al.,  2009,  Wolfe  et  al.,  2008,  Ye  et  al.,  2009]  in  scalable  machine  learning  have 
overlooked  these  rich  models  and  techniques  in  favor  of  simpler  unstructured  models  and  approximations 
which  are  more  immediately  amenable  to  existing  large-scale  parallel  tools,  abstractions,  and  infrastructure. 
Alternatively,  when  confronted  with  computationally  challenging  graphical  models  we  typically  resort  to 
model  simplification  rather  than  leveraging  ubiquitous  and  increasingly  powerful  parallel  hardware. 

In  this  thesis  we  introduce  the  simple  GrAD  design  methodology  which  leads  to  a  new  set  of  efficient 
algorithms,  theoretical  tools,  and  systems  for  scalable  parallel  and  distributed  probabilistic  reasoning.  The 
Graphical,  Asynchronous,  Dynamic,  (GrAD),  design  methodology  is  composed  of  the  following  three 
basic  principles: 

1.  Graphical:  Explicitly  represent  statistical  and  computational  dependencies  as  a  sparse  graph.  The 
sparse  graphical  representation  enables  the  parallel  decomposition  of  algorithms  and  exposes  the 
data  locality  needed  for  efficient  distributed  computation. 
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2.  Asynchronous:  Allow  computation  to  proceed  asynchronously  as  dependencies  are  satisfied  obeying 
the  dependency  structure  encoded  in  the  graph.  By  allowing  computation  to  run  asynchronously 
as  dependencies  are  satisfied  we  can  reduce  wasted  computation  and  build  more  efficient  parallel 
and  distributed  systems.  By  obeying  the  dependency  structure  we  can  obtain  provably  correct 
asynchronous  algorithms. 

3.  Dynamic:  Adaptively  identify  and  prioritize  sub-problems  that  when  solved  make  the  most  progress 
towards  the  solution.  By  first  identifying  and  solving  the  key  sub-problems  we  can  reduce  the  overall 
work  and  more  effectively  use  computational  resources. 

The  GrAD  design  methodology  lead  to  the  following  core  principle  evaluated  in  this  work: 

Thesis  Statement:  By  factoring  large-scale  probabilistic  computation  with  respect  to  the  underlying 
graphical  structure,  asynchronously  addressing  sequential  dependencies,  and  dynamically  scheduling 
computation,  we  can  construct  efficient  parallel  and  distributed  systems  capable  of  scalable  probabilistic 
reasoning. 

We  evaluate  the  thesis  statement  by  applying  the  GrAD  design  methodology  to  two  of  the  most  widely  used 
probabilistic  inference  techniques,  loopy  belief  propagation  (Chapter  3)  and  Gibbs  sampling  (Chapter  4), 
and  derive  a  sequence  of  efficient  parallel  algorithms  for  probabilistic  reasoning.  In  the  process,  we  develop 
basic  theoretical  tools  like  the  re-approximation  Section  3.3  and  the  connection  between  graph  coloring  and 
Gibbs  sampling  Section  4.3  to  characterize  the  efficiency,  scalability,  and  correctness  of  parallel  inference 
algorithms.  Using  large-scale  real-world  problems  we  experimentally  evaluate  the  resulting  algorithms  in 
the  parallel  and  distributed  settings.  We  find  that  by  applying  the  GrAD  methodology  we  are  able  to  obtain 
provably  efficient  and  correct  parallel  and  distributed  algorithms  that  substantially  outperform  existing 
approaches. 

By  generalizing  our  work  on  probabilistic  inference  algorithms,  we  derive  the  GraphLab  high-level  graph- 
parallel  abstraction  (Chapter  5)  which  explicitly  targets  the  GrAD  design  principles  while  providing  strong 
serializability  guarantees  (Section  5.4.5)  that  simplify  the  design  and  implementation  GrAD  algorithms. 
We  use  the  GraphLab  abstraction  to  implement  our  graphical  model  inference  algorithms  as  well  as  several 
new  algorithms  demonstrating  the  applicability  of  the  GrAD  methodology  to  a  wider  class  of  problems 
(e.g.,  Co-EM  in  Section  5.8.3  and  the  Lasso  in  Section  5.6.4).  We  implement  the  GraphLab  system  in 
both  the  multicore  and  distributed  settings  generalizing  our  earlier  system  design  work  by  introducing  new 
dynamic  schedulers,  improved  distributed  graph  representations,  and  fault-tolerance.  Finally,  we  evaluate 
our  implementation  of  the  GraphLab  system  and  show  orders  of  magnitude  improvements  over  traditional 
data-parallel  abstractions  like  Hadoop. 

From  our  experience  in  designing  and  deploying  GraphLab,  we  identified  a  critical  challenge  in  scaling 
graphical  computation  to  massive  real-world  graphs.  The  power-law  degree  distribution,  a  defining  char¬ 
acteristic  of  a  vast  majority  of  real-world  graphs,  violates  several  of  the  key  assumptions  that  form  the 
foundation  of  the  GraphLab  system  as  well  as  other  widely  used  systems  for  graphical  computation.  To 
address  this  challenge,  we  introduce  PowerGraph  which  refines  the  GraphLab  abstraction  and  exploits  the 
power-law  structure  as  well  as  a  common  pattern  in  GrAD  algorithms.  As  a  consequence  PowerGraph  is 
able  to  substantially  reduce  communication  overhead  and  introduce  a  new  approach  to  distributed  graph  rep¬ 
resentation.  We  demonstrate  that  the  PowerGraph  system  both  theoretically  and  experimentally  outperforms 
existing  graph  computation  systems  (including  GraphLab)  often  by  orders-of-magnitude. 
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1.1  Example:  Modeling  Political  Bias  in  a  Social  Network 


Suppose  we  are  interested  in  estimating  the  political  bias  (e.g.,  liberal  or  conservative)  of  each  user  in 
a  large  social  network  with  hundreds  of  millions  of  users  and  billions  of  relationships.  We  might  begin 
by  building  a  simple  classifier  that  predicts  political  bias  of  a  user  based  on  the  presence  of  politically 
polarizing  text  in  her  profile.  Applying  this  simple  classifier  to  each  user  in  isolation  is  a  well  studied 
problem  [Chu  et  al.,  2006]  and  can  easily  scale  to  hundreds  of  millions  of  users  by  applying  classic  parallel 
computing  techniques  and  abstractions  [Dean  and  Ghemawat,  2004].  While  we  may  be  able  to  accurately 
estimate  the  political  bias  of  users  that  frequently  post  polarizing  comments,  for  the  vast  majority  of  users 
who  post  infrequently  the  profile  may  not  contain  sufficient  information  to  draw  strong  conclusions  about 
political  bias  (see  Figure  1.1a). 


;P 

;P 


(a)  Without  Social  Network 


(b)  Predictions  With  Social  Network 


Figure  1.1:  Social  Network  Example:  Consider  the  task  estimating  the  political  bias  of  users  in  a  large 
social  network.  In  (a)  we  have  politically  informative  comments  for  only  a  small  subset  of  the  users  leaving 
us  unable  to  make  predictions  about  the  vast  majority.  However  by  including  the  social  network  structure 
(b)  we  can  substantially  improve  our  predictions. 


Graphical:  If  we  rely  exclusively  on  the  profile  text  to  estimate  political  bias,  we  are  discarding  the  import 
relationships  between  users.  Moreover,  knowing  something  about  the  political  bias  of  your  friends,  their 
friends,  and  your  greater  social  community  can  reveal  a  lot  about  your  political  bias  (see  Figure  1.1b). 
Unfortunately,  directly  modeling  all  of  these  relationship  leads  to  overly  parameterized  models  and  is 
generally  computation  intractable.  Instead,  we  can  decompose  the  relationship  along  the  original  social 
graph.  However,  to  be  able  to  model  the  long-range  relationships  (e.g.,  the  effect  of  your  community  on 
your  political  bias)  we  will  need  to  iteratively  move  information  through  the  graph  (transitive  closure).  This 
can  be  accomplished  by  using  inference  algorithms  that  factorize  over  the  graph  structure  (i.e.,  operate  only 
at  the  level  of  each  vertex  and  its  neighbors)  and  iteratively  refine  simple  models  at  each  vertex  based  on 
the  models  on  neighboring  vertices. 

Asynchronous:  Having  decomposed  the  task  of  modeling  long  range  relationships  into  the  task  of 
iteratively  refining  local  relationships,  we  introduced  a  substantial  amount  of  parallelism.  For  example 
in  the  large  social  network  we  could  in  principle  refine  the  model  for  each  user  simultaneously.  As 
a  consequence,  many  have  proposed  synchronously  refining  all  users  models  by  simulating  a  separate 
processor  for  each  user.  However,  we  will  show  that  it  is  substantially  more  efficient  to  refine  users 
profiles  asynchronously  as  processor  resources  become  available  and  using  the  most  recent  estimates  for 
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neighboring  profiles.  Intuitively,  if  a  processor  is  currently  refining  one  of  your  neighbors  models  then  it  is 
often  better  to  wait  for  your  neighbors  new  model  then  to  recompute  based  on  the  old  model. 

Dynamic:  In  a  large  social  network  there  are  often  trend  setters  whose  opinion  shapes  their  community. 
Intuitively,  applying  computational  resources  to  refine  the  local  models  for  the  trend  setters  can  have  a 
greater  impact  than  applying  the  same  resources  to  the  vast  majority  of  the  remaining  users.  Alternatively, 
accurately  estimating  the  models  for  the  vast  majority  based  on  inaccurate  models  for  the  trend  setters 
wastes  resources  since  the  estimates  for  the  vast  majority  are  likely  to  change  as  we  refine  the  estimates  for 
the  trend  setters.  By  dynamically  identifying  and  focusing  computational  resources  on  the  trend  setters  we 
can  eliminate  wasted  computation  and  accelerate  convergence. 

GraphLab:  In  order  to  implement  and  run  the  proposed  Graphical  Asynchronous  Dynamic  (GrAD) 
algorithm  we  would  need  to  address  a  wide  range  of  complex  multicore  and  distributed  system  design 
challenges  including:  efficient  re-entrant  data  structure  design,  race  and  deadlock  elimination,  distributed 
data  placement  and  coordination,  and  fault-tolerance.  While  there  are  a  several  high-level  parallel  and 
distributed  abstractions  which  effectively  hide  the  challenges  of  system  design,  none  of  them  naturally 
express  GrAD  algorithms.  To  fill  this  void  we  introduced  the  GraphLab  abstraction  which  directly  targets 
GrAD  algorithms  and  hides  all  of  the  challenges  of  multicore  and  distributed  system  design.  Using  the 
GraphLab  abstraction,  we  can  quickly  and  easily  implement  and  debug  our  GrAD  algorithm.  Then  using 
the  many  interchangeable  GraphLab  schedulers  we  can  evaluate  the  effect  of  different  dynamic  schedules 
on  convergence  and  scalability.  Finally,  we  can  apply  our  algorithm  to  large  graphs  in  both  the  multicore 
and  distributed  settings. 

PowerGraph:  However,  if  we  try  to  scale  GraphLab  to  the  full  social  network  with  billions  of  edges  using 
a  large  cluster  of  machines,  we  quickly  find  that  communication  overhead  dominates  and  performance 
degrades.  Furthermore  if  we  try  to  use  other  abstractions  or  traditional  techniques  for  implementing 
distributed  graph  computation  the  situation  does  not  improve.  The  problem  is  that  the  power-law  sparsity 
structure,  commonly  found  in  graphs  derived  from  natural  phenomena  (e.g.,  social  networks  and  the  web), 
violates  many  of  the  assumptions  made  by  traditional  graph-parallel  systems  (including  the  GraphLab 
system).  However,  by  exploiting  a  common  pattern  in  GrAD  algorithms  and  the  structure  of  power-law 
graphs  we  can  refine  the  GraphLab  abstraction  and  produce  a  new  system  PowerGraph  that  is  orders  of 
magnitude  faster  and  more  efficient. 


1.2  Contributions  and  Greater  Impact 


We  evaluate  the  thesis  statement  by  developing  the  core  algorithms,  theory,  and  tools  needed  to  scale 
probabilistic  reasoning  to  challenging  real-world  problems  in  the  parallel  and  distributed  settings.  More 
precisely,  we: 

•  design  and  implement  a  suite  of  efficient  parallel  and  distributed  graphical  model  inference  algorithms 
based  on  the  GrAD  principles.  Chapter  3,  Chapter  4) 

•  develop  theoretical  tools  to  characterize  the  efficiency  and  scalability  of  the  proposed  algorithms  as 
well  as  the  underlying  parallel  structure  of  inference  in  graphical  models.  (Chapter  3,  Chapter  4) 

•  introduce  the  GraphLab  abstraction  and  system  which  simplifies  the  design  and  implementation  of 
scalable  GrAD  parallel  and  distributed  algorithms.  (Chapter  5) 
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•  introduce  the  PowerGraph  abstraction  and  system  which  refines  GraphLab  to  efficiently  express 
computation  on  power-law  graphs  and  capture  the  fundamental  structure  of  the  GrAD  methodology. 
(Chapter  6) 

•  evaluate  the  proposed  techniques  on  real-world  problems  using  a  variety  of  parallel  platforms  and 
demonstrate  substantially  improved  scaling,  computational  efficiency,  accuracy,  and  speed  relative  to 
existing  systems  and  techniques. 

The  algorithms,  systems,  and  abstractions  developed  throughout  this  thesis  have  already  started  to  have 
significant  impact  on  the  greater  research  community.  The  GraphLab  and  PowerGraph  systems  along  with 
open-source  implementations  of  a  wide  variety  of  GrAD  algorithms  for  scalable  probabilistic  reasoning 
have  been  widely  downloaded  and  used  to  address  tasks  ranging  from  collaborative  filtering  to  natural 
language  modeling.  Along  with  help  from  collaborators  we  have  fostered  a  community  around  graphical 
approaches  to  scalable  probabilistic  reasoning. 

In  the  future,  we  believe  the  emphasis  on  graph  structured  models  and  systems  for  scalable  efficient 
inference,  will  enable  the  development  and  application  of  substantially  richer  models  to  the  massive  scale 
real-world  problems.  Throughout  this  thesis  we  give  examples  of  sophisticated  structured  models  and 
how  they  are  able  to  address  complex  real-world  problems.  By  efficiently  using  parallel  and  distributed 
resources,  the  algorithms  and  techniques  described  in  this  thesis  can  substantially  decrease  the  time  required 
to  design,  evaluate,  and  debug  new  sophisticated  models  on  the  complete  data  allowing  model  designers  to 
capture  the  full  complexity  of  the  underlying  phenomenon.  We  believe  this  will  lead  to  a  proliferation  of 
new  structured  models  and  inference  techniques  in  a  wide  range  of  applications. 

The  GraphLab  and  PowerGraph  high-level  abstractions  will  allow  advances  in  scalable  system  design  to 
be  directly  applicable  to  algorithms  built  upon  these  abstractions.  By  abstracting  away  the  complexity 
of  scalable  probabilistic  reasoning,  the  GraphLab  and  PowerGraph  abstractions  enable  system  designers 
to  develop  and  apply  advances  in  eventually  consistent  data  representations,  parallel  and  distributed  data 
structure  design,  and  networking  to  a  wide  range  of  probabilistic  reasoning  tasks.  Similarly,  by  isolating  the 
challenges  of  scalable  parallel  and  distributed  system  design,  GraphLab  and  PowerGraph  system  enables 
experts  in  probabilistic  reasoning  to  easily  and  efficiently  construct  rich  models  and  algorithms  which 
leverage  a  wide  range  of  architectures. 

Finally,  we  believe  the  GrAD  design  methodology  and  the  lessons  learned  in  this  thesis  will  help  refocus 
the  course  of  future  research  in  scalable  probabilistic  reasoning  away  from  simple  unstructured  models 
and  parallelism  at  the  expense  algorithm  efficiency  and  towards  rich  structured  models  and  efficient  GrAD 
algorithms  and  systems.  This  thesis  provides  a  tractable  approach  to  rich  structured  models  with  scalable 
and  efficient  distributed  algorithms  and  systems  and  opens  a  wide  range  of  new  research  directions  including, 
distributed  dynamic  scheduling,  the  role  of  model  structure  in  approximate  probabilistic  reasoning,  scalable 
asynchronous  consistency  protocols,  efficient  distributed  placement  and  representation  of  natural  graphs, 
and  new  approaches  to  recursive  model  design. 


1.3  Outline  and  Summary  of  Key  Results 


We  begin  in  Chapter  2  by  briefly  reviewing  some  of  the  key  terminology,  operations,  and  properties  of 
probabilistic  graphical  models  as  well  as  parallel  and  distributed  algorithms  and  systems. 

In  Chapter  3  and  we  apply  the  GrAD  methodology  to  design  and  implement  algorithms  and  systems  for 
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(a)  Contributions  in  Belief  Propagation 
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(b)  Contributions  in  Gibbs  Sampling 
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(c)  Contributions  in  High-Level  Abstractions  and  Systems 


Figure  1.2:  The  contributions  of  the  thesis  work  in  the  context  of  existing  work  in  graphical  models  inference  (belief 
propagation  and  Gibbs  sampling)  as  well  high-level  abstractions  and  systems  for  parallel  and  distributed  computation. 
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(a)  Execution  of  the  Splash  Algorithm  on  the  Image  Denoising  Task 


(b)  Speedup  (c)  Work  (d)  Speedup 

Figure  1 .3:  Key  Results  from  Work  on  Belief  Propagation:  In  (a)  we  plot  the  number  of  times  each  pixel  is  recomputed 
when  applying  the  SplashBP  algorithm  to  the  image  denoising  task.  Brighter  pixels  have  been  recomputed  more  often.  As  the 
algorithm  converges  it  adaptively  identifies  the  challenging  parts  of  the  image  and  focuses  computation  where  it  is  needed.  (b,c) 
By  introducing  more  aspects  of  the  GrAD  we  are  able  to  improve  the  scalability  and  efficiency  of  probabilistic  reasoning  systems. 
The  Wildfire,  Residual,  and  Splash  algorithms  implement  all  three  of  the  GrAD  principles.  By  closely  preserving  the  optimal 
sequential  schedule  the  SplashBP  algorithm  is  able  to  most  efficiently  schedule  computation,  (d)  To  deal  with  work  imbalance  in 
the  distributed  setting  we  introduced  a  simple  over  partitioning  heuristic  (larger  k)  which  can  substantially  improve  scalability. 


probabilistic  inference  using  loopy  belief  propagation.  We  incrementally  introduce  each  aspect  of  the 
GrAD  approach  and  derive  a  sequence  of  algorithms  which  build  upon  each  other  (see  Figure  1.3).  In  the 
process  we  develop  basic  theoretical  tools  for  reasoning  about  the  efficiency  and  convergence  of  parallel 
belief  propagation  algorithms  and  relate  the  parallel  scaling  to  both  structural  properties  of  the  graph  as 
well  as  underlying  model  parameters.  In  the  end  we  arrive  at  the  SplashBP  algorithm  which  adaptively 
moves  wavefronts  across  the  graph  and  embodies  the  key  ideas  described  in  the  thesis  statement.  We  then 
implement  the  each  algorithms  in  the  multicore  and  distributed  settings  and  evaluate  their  performance  on  a 
wide  range  of  metrics  using  real-world  problems. 

In  Chapter  4  we  apply  the  GrAD  methodology  to  Gibbs  sampling,  a  widely  used  inference  technique 
in  Bayesian  statistics.  As  with  our  work  on  belief  propagation  we  derive  a  sequence  of  algorithms  each 
introducing  new  properties  of  the  GrAD  methodology.  In  contrast  to  existing  work  on  parallel  Gibbs 
sampling,  we  prove  that  the  resulting  parallel  algorithms  are  ergodic  and  therefore  preserve  the  same 
guarantees  as  their  sequential  counterparts.  In  addition  we  relate  the  parallel  scaling  to  graph  theoretic 
properties  of  the  underlying  Markov  Random  Field  and  reveal  the  limiting  distribution  of  existing  non- 
ergodic  parallel  Gibbs  samplers.  Finally,  we  arrive  at  the  SplashGibbs  algorithm  which  applies  many  of  the 
lessons  learned  from  our  work  on  SplashBP  to  substantially  accelerate  burn-in  and  mixing  (see  Figure  1 .4. 
We  then  implement  each  of  the  algorithms  in  the  multicore  setting  and  evaluate  their  performance  on  a 
collection  of  synthetic  and  real-world  sampling  tasks. 

In  Chapter  5  we  introduce  the  GraphLab  graph-parallel  abstraction  which  generalizes  our  work  in  graphical 
model  inference  and  is  specifically  targeted  at  algorithms  that  follow  the  GrAD  methodology.  We  begin  by 
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(a)  Noisy  (b)  FIFO  Splashes  (c)  FIFO  Splashes  (+)  (d)  Priority  Splashes  (e)  Priority  Splashes  (+) 


u0  20  40  60  80  100 
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(f)  CORA-1  Likelihood 


(g)  CORA-1  Variance  (h)  CORA-1  Samples  Count 


(i)  CORA-1  Speedup 


Figure  1.4:  Key  Results  from  Work  on  Gibbs  Sampling:  (a)  A  noisy  image  of  a  sunset  illustration,  (b  -  e)  Different  thin 
junction  tree  Splashes  constructed  for  an  image  denoising  task  (grid  MRF).  Each  splash  is  shown  in  a  different  shade  of  gray  and 
the  black  pixels  are  not  assigned  to  any  Splash.  In  (c)  and  (e)  we  zoom  in  on  the  Splashes  to  illustrate  their  structure  and  the  black 
pixels  along  the  boundary  needed  to  maintain  conditional  independence,  (f-i)  Comparison  of  Chromatic  sampler  and  the  Splash 
sampler  at  different  settings  on  a  real  world  citation  modeling  task,  (f)  The  un-normalized  log-likelihood  plotted  as  a  function  of 
running-time,  (g)  The  variance  in  the  estimator  of  the  expected  assignment  computed  across  10  independent  chains  with  random 
starting  points,  (h)  The  total  number  of  variables  sampled  in  a  20  second  window  plotted  as  a  function  of  the  number  of  cores,  (i) 
The  speedup  in  number  of  samples  drawn  as  a  function  of  the  number  of  processors. 
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(c)  Serializability  and  Parallelism 
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(f)  Other  Alg.  Speedup  (g)  EC2  Price/Performance 


Figure  1.5:  Key  Results  for  GraphLab:  Computation  in  GraphLab  is  encoded  as  a  vertex  programs  that  can  read  and  modify 
the  data  on  adjacent  vertices  and  edges.  In  (a)  the  vertex-program  running  on  the  vertex  1  has  access  to  all  the  data  in  the  shaded 
region.  GraphLab  automatically  enforces  strong  serializability  guarantees  through  three  exclusion  policies  (b)  which  allow  the  user 
to  explore  the  tradeoff  (c)  between  parallelism  and  data  consistency.  In  (d,e)  we  plot  the  multicore  speedup  of  our  earlier  parallel 
belief  propagation  and  Gibbs  sampling  algorithms  running  within  the  GraphLab  framework.  In  (f)  we  plot  the  distributed  speedup 
of  several  new  GrAD  algorithms  implemented  using  the  GraphLab  abstraction.  Finally,  in  (g)  we  demonstrate  the  substantial 
financial  savings  and  performance  gains  of  the  GraphLab  abstraction  when  compared  to  Hadoop. 


reviewing  the  key  properties  of  algorithms  that  follow  the  GrAD  methodology  and  discussing  how  existing 
high-level  abstractions  are  not  well  suited  for  these  algorithms.  We  then  present  the  GraphLab  abstraction 
and  demonstrate  how  GraphLab  is  able  to  naturally  represent  GrAD  algorithms  while  providing  strong 
serializability  guarantees.  We  then  implement  and  evaluate  the  GraphLab  system  in  the  multicore  and 
distributed  settings  demonstrating  strong  scalability  and  substantial  gains  relative  to  MapReduce  based 
abstractions  (i.e.,  Hadoop). 

In  Chapter  6,  we  identify  the  power-law  sparsity  structure  found  in  most  real-world  graphs  (Figure  1.6) 
as  one  of  the  critical  challenges  to  the  scalability  of  distributed  graph-parallel  computation.  We  describe 
how  the  high-degree  vertices  and  low  quality  balanced  cuts  found  in  power-law  graphs  leads  to  substantial 
storage  and  communication  overhead  for  GraphLab  as  well  as  other  popular  message  based  graph-parallel 
abstractions.  We  then  propose  the  GAS  decomposition  which  factors  vertex  programs  over  edges  in 
the  graph  and  enables  the  use  of  vertex  cuts  to  efficiently  place  large  power-law  graphs  in  a  distributed 
environment.  We  then  propose  a  set  of  three  simple  distributed  techniques  to  construct  vertex  cuts  as 
the  graph  is  loaded  and  theoretically  analyze  these  techniques  demonstrating  order  of  magnitude  gains 
over  standard  randomized  edge  cuts.  Finally,  we  evaluate  the  PowerGraph  system  on  several  real-world 
problems  and  demonstrate  order  of  magnitude  gains  against  existing  state-of-the-art  results. 
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(a)  Twitter  In-Degree  (b)  Graph  Decomposition  and  Comm. 
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(d)  Effect  of  Partitioning 


(e)  Power-law  Fan-Out  Comm,  (f)  Power-law  Fan-Out  Runtime  (g)  Triangle  Count  Runtime 


Figure  1 .6:  Key  Results  for  PowerGraph:  The  vast  majority  of  real-world  graphs  have  a  (a)  power-law  degree  distribution. 
Here  we  have  plotted  the  in-degree  distribution  for  the  twitter  follower  graph.  The  power-law  degree  distribution  leads  to  a 
small  set  of  very  high  degree  vertices.  The  PowerGraph  abstraction  (b)  splits  high  degree  vertices  across  machines  and  factor 
vertex-programs  along  edges  in  the  graph  using  the  Gather-Apply-Scatter  (GAS)  decomposition.  The  PowerGraph  abstraction 
enables  the  use  of  vertex-cuts  which  substantially  reduce  the  communication  overhead  (c)  relative  to  traditional  edge-cuts.  We 
introduced  several  new  partitioning  heuristics  (d)  which  improve  the  overall  performance  algorithms  run  on  the  PowerGraph 
system.  In  figures  (c,d)  we  demonstrate  that  the  PowerGraph  abstraction  out  performs  GraphLab  and  Pregel  and  is  robust  to 
changes  in  the  power-law  constant  (smaller  a  implies  more  high  degree  vertices).  Finally,  in  (g)  we  plot  the  runtime  in  minutes  of 
PowerGraph  and  Hadoop  when  computing  triangles  in  the  twitter  follower  graph.  PowerGraph  is  over  282  times  faster  than  the 
best  published  result  for  this  problem. 
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Chapter  2 


Background 


This  work  combines  ideas  from  statistical  machine  learning,  parallel  algorithms,  and  distributed  systems. 
In  this  chapter  we  provide  the  background  on  the  relevant  concepts  from  each  of  these  fields  and  introduce 
some  the  notation  used  throughout  the  remaining  chapters. 


2.1  Probabilistic  Reasoning 

Suppose  we  are  interested  in  modeling  the  political  bias  (e.g.,  liberal  or  conservative)  of  users  in  a  large 
social  network.  We  might  begin  by  representing  the  political  bias  of  each  user  i  as  a  Bernoulli  random 
variable  ~  Bern($i)  taking  the  value  0  if  i  is  a  liberal  and  1  if  i  is  a  conservative.  The  parameter  9\ 
determines  the  probability  P  (X,t  =  1)  =  6\  that  any  user  is  conservative  and  can  be  estimated  by  polling 
or  expert  knowledge  (e.g.,  0\  =  0.5).  While  this  simple  model  provides  a  prediction  of  the  political  bias  for 
each  user,  it  does  not  take  into  account  information  unique  to  each  user  (e.g.,  the  users  profile  or  purchase 
history). 

We  can  improve  our  model  by  incorporating  the  observed  profile  text  r/j  for  each  user.  This  can  be 
accomplished  by  defining  the  conditional  probability  P  (Yt  \  Xr )  of  observing  the  text  Y,  given  the 
user  has  the  political  bias  X, .  Using  expert  knowledge  or  polling  data  for  our  users  we  can  estimate 

P  (Yi  |  Xt) 


P  (Yi  =  y.i  |  Xi  =  Xi )  = 


Xi  =  0  (liberal) 

Xi  =  1  (conservative) 

’’Obamacare”  (ji  yi 

0.8 

0.3 

’’Obamacare”  e  yt 

0.2 

0.7 

(2.1) 


Using  Bayes  rule  we  can  combine  the  prior  P  ( X., )  with  the  likelihood  P  (Yt  \  Xi)  to  compute  the 
posterior  probability  that  a  particular  user  is  a  conservative  or  liberal  given  their  profile  text: 


P  (Xi\Yi)  oc 


P  (Yj  |  Xj)  P  (Xj) 
J2Xi  p  iYi  I  Xt)  P  (Xi) 


(2.2) 


While  we  may  be  able  to  accurately  estimate  the  political  bias  of  users  that  frequently  post  polarizing 
comments,  for  many  users  who  post  infrequently  the  profile  may  not  contain  sufficient  information  to  draw 
strong  conclusions  about  political  bias. 


11 


If  we  rely  exclusively  on  the  profile  text  to  estimate  political  bias,  we  are  discarding  the  import  relationships 
between  users.  Moreover,  our  model  assume  that  knowing  the  political  bias  of  a  user’s  friends  tells  us 
nothing  about  the  users  own  political  bias,  which  is  clearly  not  the  case.  More  precisely,  our  model 
incorrectly  assumes  that  the  random  variables  corresponding  to  political  bias  are  independent.  However 
relaxing  the  independence  assumption  in  the  context  of  large  datasets  is  both  challenging  and  one  of  the 
principle  goals  of  this  thesis  work. 

We  can  relax  the  independence  assumption  by  defining  the  full  joint  probability  over  all  the  random 
variables  in  our  model.  In  the  context  of  the  social  networking  example,  if  we  just  considered  the  random 
variables  X,  the  joint  probability  takes  the  form: 

P(Xll...,Xn)  (2.3) 


which  returns  a  probability  for  every  possible  assignment  to  each  random  variable  X,  in  the  model.  For  a 
social  network  with  only  two  people  we  could  encode  the  joint  probability  as  a  simple  table: 


P(X1,X2) 


p  (xi,x2) 

X\  =  0,  x2  =  0 

.4 

X\  =  0,X2  =  1 

.1 

Xi  =  1,X2  =  0 

.1 

Xl  =  l,x2  =  1 

.4 

(2.4) 


which  effectively  states  that  friends  are  more  likely  to  share  the  same  political  view.  Using  the  joint 
probability  function  we  can  answer  questions  about  the  relationship  between  random  variables.  For 
example,  in  the  above  two  user  social  network  the  probability  that  user  one  is  a  liberal  given  that  user  two 
is  a  liberal  can  be  expressed  as: 


P  (X1  =  0  I  X2  =  0) 


P(X1  =  0,X2  =  0) 

£BlP(*i  =  0,X2  =  0) 


0.4  _  4 

0.4 +  0.1  ~~  5 


(2.5) 


However,  to  extend  this  approach  to  even  a  small  social  network  of  100  users  we  would  need  a  table  with 
2100  rows  and  the  calculation  in  Eq.  (2.5)  would  requiring  a  summation  over  299  states! 


2.1.1  Factorized  Joint  Probability  Distributions 

Directly  learning  and  manipulating  the  full  joint  probability  distribution  over  every  possible  states  for  all 
random  variables  is  typically  intractable.  However,  for  most  problems  we  can  factorize  the  joint  probability 
distribution: 

V(Xl,...,Xn\e)  =  ^—  n/A(xA|0),  (2.6) 

into  a  products  of  functions  f\  called  factors.  Each  factor  f\  is  positive  function,  /a  :  xa  — >  M+,  over  a 
small  subset  A  C  {1, . . . ,  n}  of  random  variables  Xa-  The  factors  are  parametrized  by  a  set  of  parameters 
6,  and  the  partition  function  Z(9)  is  defined  as: 

^)  =  E--'E  n  /a(xa|0).  (2.7) 

Xt  xn  Ae-F 

While  we  will  primarily  focus  on  distributions  over  discrete  random  variables  Xt  <E  ft)  =  { 1 , . . . ,  |  A)  [  } 
most  of  the  proposed  methods  may  also  be  applied  to  continuous  distributions. 
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If  we  return  to  our  social  networking  example,  one  possible  factorized  representation  of  the  joint  probability 
of  all  the  users  and  their  profiles  is: 


P  (Xu . .  .,Xn,Yu  . . .  ,Yn)  cx  m  P  (Yi  |  Xi)  n  PPG 


(2.8) 


Vi=l 


*0=1 


This  factorized  representation  is  equivalent  to  the  independent  user  model  we  proposed  earlier.  Another 
possible  representation  for  the  joint  probability  distribution  is: 


P  (X1,...,Xn,Yu...,Yn)  cx  (np(^))  (  n  (Xi,Xj  \0)  )  . 

dhj)£E 


(2.9) 


u=i 


where  E  is  the  set  of  all  edges  in  the  social  network  and  the  the  factors  fuj} : 

={J; 


(2.10) 


(where  0  <  9  <  1)  are  defined  over  friends  i  and  j  and  encode  the  prior  belief  that  friends  should  share 
a  similar  political  bias.  Given  this  model  we  can  now  condition  on  the  observed  user  text  and  answer 
questions  about  each  user’s  political  bias  in  the  context  of  their  friends  and  their  greater  community. 

In  this  example  we  started  by  modeling  users  independently  and  then  improved  our  model  by  capturing 
the  relationships  between  friends.  In  the  process  we  described  a  compact  way  to  express  a  complex 
joint  distribution.  We  now  formalize  this  modeling  technique  and  describe  the  methods  for  applying  and 
analyzing  these  models. 


2.1.2  Probabilistic  Graphical  Models 

Probabilistic  graphical  models  provide  a  common  language  and  set  of  algorithmic  tools  for  efficiently 
representing  and  manipulating  large  factorized  distributions.  This  is  accomplished  by  constructing  a 
correspondence  between  the  factorized  distribution  and  a  combinatorial  graph  representation  upon  which 
we  can  define  algorithms  for  learning  and  manipulating  the  underlying  model. 

A  Markov  Random  Field  (MRF)  is  an  undirected  graphical  representation  of  a  factorized  distribution. 
The  MRF  corresponding  to  the  factorized  probability  distribution  in  Eq.  (2.6)  is  an  undirected  graph  over 
the  variables  where  X+  is  connected  to  Xj  if  there  is  a  A  G  T  such  that  i,j  6  A.  We  denote  the  set  of 
indicies  corresponding  to  the  variables  neighboring  X,  as  C  { 1 , . . . ,  n}  and  therefore  X\r  is  the  set  of 
neighboring  variables. 

To  provide  a  easy  to  visualize  illustration  of  a  Markov  Random  Field  we  introduce  the  digital  image 
denoising  task.  We  can  construct  a  factorized  distribution  to  model  the  noisy  pixels  in  a  digital  image  (see 
Figure  2.1).  For  each  pixel  sensor  reading  we  define  an  observed  random  variable  y,  and  a  corresponding 
unobserved  random  variable  Xt  which  represents  the  true  pixel  value.  We  can  then  introduce  set  of  factors 
corresponding  to  the  noise  process  (e.g.,  Guassian): 


f{i}(Xi  =  Xi\Yi=  yi ,  0)  =  exp 


(2.11) 
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(a)  Original  Image  (b)  Noisy  Image  (c )  MRF  (d)  Markov  Blanket 


Figure  2.1:  Image  denoising  problem,  (a)  The  synthetic  noiseless  original  image,  (b)  The  noisy  pixel 
values  observed  by  the  camera,  (c)  The  MRF  corresponding  to  the  potentials  in  Eq.  (2.11)  and  Eq.  (2.12). 
The  large  gray  vertices  correspond  to  the  latent  unobserved  pixel  values  Xj  while  the  dark  black  circles 
correspond  to  the  observed  pixel  values  y,  (d)  The  Markov  blanket  for  variable  .  (e)  The  denoise  factor 
graph.  As  with  the  MRF  we  have  removed  the  Y  random  variables  and  instead  associated  each  X,  with  the 
unary  factor  /*.  (f)  The  expected  assignment  to  each  of  the  latent  pixels. 


To  encode  smoothness  in  the  image  we  introduce  factors  which  encode  similarity  between  neighboring 
pixels  (e.g.,  Laplace  smoothing): 

f{i,j}(Xi  =  Xi,  Xj  =  Xj  |  9 )  =  exp  (-6*2  \xi  -  xj\) .  (2.12) 

for  adjacent  Xr  and  Xj  (e.g.,  cardinal  neighbors).  Then  we  can  express  the  factorized  joint  distribution 
(Eq.  (2.6))  as: 

P(Xu...,Xn,  Yi, . . . ,  Yn)  oc  (  f[  /W(X?:  =Xi\Yi  =  yi,9)j  f  f{yj}{Xt  =  xh  Xj  =  Xj  \  9) 

Vi=i  /  \(ij)eE 

(2.13) 

The  Markov  random  field  for  the  image  denoising  problem  is  drawn  in  Figure  2.1c,  and  the  indicies  of  the 
variables  neighboring  X5  in  Figure  2.1c  arc  A/5  =  {2, 4,  6,  8}.  To  resolve  the  underlying  image  w  e  might 
then  want  to  compute  the  expected  assignment  E  [Xj]  for  each  pixel  as  in  Figure  2.  If. 

The  edges  in  the  MRF  encode  both  the  statistical  dependencies  between  the  variables  and  the  computational 
dependencies  of  the  algorithms  that  operate  on  these  models.  As  a  consequence  the  MRF  is  a  crucial 
structure  in  designing  scalable  parallel  and  distributed  algorithms  for  structured  probabilistic  reasoning. 
In  addition,  the  MRF  also  provides  a  means  to  leverage  the  substantial  existing  work  in  parallel  graph 
algorithms  and  job  scheduling. 

When  building  large-scale  probabilistic  graphical  models,  sparsity  in  the  MRF  is  a  desirable  property. 
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Densely  connected  models  typically  require  too  many  parameters  and  make  exact  and  even  approximate 
inference  computationally  intractable.  Auspiciously,  sparsity  is  also  a  desirable  property  for  parallelism. 
Many  algorithms  that  operate  on  graphical  models  factor  with  respect  to  the  Markov  Blanket  of  each 
variable.  The  Markov  Blanket  (see  Figure  2.  Id)  of  a  variable  is  the  set  of  neighboring  variables  in  the 
MRF.  Therefore  sparse  models,  provide  strong  locality  in  computation  which  is  a  very  desirable  property  in 
parallel  algorithms.  In  fact  the  new  parallel  abstraction  we  develop  in  Chapter  5  is  built  around  the  locality 
derived  from  sparsity  in  the  MRF  and  the  idea  that  computation  factorizes  over  Markov  blankets. 


Factor  Graphs 


In  some  settings  it  may  be  more  convenient  to  reason  about  the  factor  graph  corresponding  to  the  factorized 
distribution.  A  factor  graph  is  an  undirected  bipartite  graph  where  the  vertices  V  =  X  U  {/A}Arj- 
correspond  to  all  the  variables  X  on  one  side  and  all  the  factors  {./a }Agjr  on  the  other  side.  Undirected 
edges  E  =  { { /A ,  Xt }  :  i  G  A,  A  G  F)  connect  factors  with  the  variables  in  their  domain.  To  simplify 
notation,  we  will  use  f  ,  G  U  and  Xj  G  V  to  refer  to  vertices  in  the  factor  when  we  wish  to  distinguish 
between  factors  and  variables,  and  i,  j  G  V  otherwise.  As  with  the  MRF  we  define  A/j  C  V  as  the 
neighbors  of  %  G  V  in  the  factor  graph.  In  Figure  2.1e  we  illustrate  the  factor  graph  for  the  image  denoising 
problem. 


2.1.3  Inference 

There  arc  three  primary  operations  within  graphical  models:  inference,  parameter  learning,  and  structure 
learning.  Given  the  structure  and  parameters  of  the  model,  inference  is  the  process  of  computing  or 
estimating  marginals: 


P(Xs|0)=  £  p(xs>x{i,...,n}\sl0)  (2-14) 

x{l,...,n}\5 

Estimating  marginal  distributions  is  an  essential  step  in  probabilistic  reasoning  and  is  needed  to  construct 
conditionals  and  posteriors,  make  predictions,  and  evaluate  the  partition  function  Z(6).  For  example,  by 
constructing  the  marginals  of  a  factor  graph  representing  the  interactions  between  protein  side  chains  we 
can  compute  the  likely  orientations  of  each  side  chain  and  bound  the  energy  of  a  particular  configuration. 
Alternatively,  the  marginals  can  be  used  in  various  learning  algorithms  to  optimize  model  parameters. 

Unfortunately,  exact  inference  is  NP-hard  in  general  [Cooper,  1990]  and  even  computing  bounded  approxi¬ 
mations  is  known  to  be  NP-hard  [Roth,  1993].  Since  exact  inference  is  not  tractable  in  most  large-scale 
real-world  graphical  models  we  typically  resort  to  approximate  inference  algorithms.  Fortunately,  there  are 
many  of  approximate  inference  algorithms  which  typically  perform  well  in  practice.  In  addition  we  can 
often  use  the  approximations  made  by  approximate  inference  algorithms  to  introduce  additional  parallelism. 
In  some  cases  (e.g..  Section  3.3)  we  can  introduce  new  approximations  that  arc  tailored  to  the  objective  of 
increasing  parallelism  with  a  bounded  impact  on  the  resulting  error  in  inference. 
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2.1.4  Parameter  Learning 


Parameter  learning  in  graphical  models  is  used  to  choose  the  value  of  the  parameters  9,  usually  by  fitting 
the  model  to  data.  The  parameters  may  correspond  to  individual  entries  in  a  probability  table  or  the  weights 
in  functions  (factors)  over  sets  of  variables.  Often  many  factors  will  depend  on  a  single  parameter  (shared 
parameter).  In  the  image  denoising  task  their  are  only  two  parameters  which  are  shared  across  all  the 
factors  and  act  as  weights. 

Parameter  learning  is  typically  accomplished  by  maximizing  the  log-likelihood  of  the  data: 

d 

1?mle  =  arg  max  log  P  |  o]  (2.15) 

i= 1 
d 

arg  max  EE  log/A^  I  9)  -  dlog  En  /a(xa  |  9)  (2.16) 

i= 1  AeT  X  AeJ7 

Unfortunately,  while  the  left  term  (unnormalized  log-likelihood)  in  Eq.  (2.16)  is  tractable  to  compute,  the 
right  term  (log-partition  function)  is  intractable  and  must  be  computed  through  approximate  inference. 
Therefore  by  focusing  on  improving  the  scalability  of  inference  we  can  also  improve  the  scalability  of 
parameter  learning. 

In  many  applications  parameter  learning  is  not  necessary.  In  a  purely  Bayesian  formulation,  priors  are 
placed  on  parameters  and  the  actual  parameter  values  are  treated  as  variables  which  are  eliminated  through 
marginalization  during  inference.  Often  domain  knowledge,  like  the  physical  energy  between  interacting 
molecules,  is  used  to  select  parameter  values. 

2.1.5  Structure  Learning 

The  final  and  typically  most  challenging  task  in  graphical  models,  is  structure  learning.  Here  the  structure 
of  the  MRF,  or  the  model  factorization  encoded  by  J7,  is  learned  from  data.  Structure  learning  is  most 
challenging  because  it  adds  a  combinatorial  search  on-top  of  the  already  intractable  problems  of  learning 
and  inference.  Moreover,  in  order  to  apply  structure  learning  we  need  a  method  to  impose  factors  over 
the  maximal  cliques  in  the  MRF  and  then  the  ability  to  learn  optimal  values  for  the  factor  parameters  9 
given  the  structure.  Therefore,  structure  learning  requires  parameter  learning  and  consequently  inference  as 
nested  subroutines. 

In  many  real-world  applications  of  graphical  models,  the  structure  is  typically  predetermined  from  prior 
knowledge.  For  example,  in  many  vision  applications  the  structure  is  a  regular  grid  or  mesh  connecting 
nearby  pixels  and  objects.  Alternatively,  in  language  modeling  applications  the  structure  is  determined  by 
the  presence  of  words  in  documents.  Even  in  our  social  network  example,  the  graphical  model  structure  is 
derived  from  the  friend  network.  Therefore,  in  this  thesis  we  will  assume  that  the  structure  and  parameters 
of  the  model  are  given,  however  we  believe  that  the  search  for  scalable  parameters  and  substructures  is  an 
exciting  direction  for  future  research. 

2.1.6  Summary  of  Important  Graphical  Models  Properties  in  the  Parallel  Setting 

Graphical  models  provide  a  standardized  representation  of  complex  probabilistic  relationships  enabling  the 
design  and  implementation  of  general  puipose  probabilistic  reasoning  algorithms.  Moreover,  by  providing 
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a  small  suite  of  efficient  parallel  algorithms  we  can  address  a  wide  range  of  applications.  In  this  thesis  we 
focus  on  inference  algorithms  which  are  the  key  computational  bottleneck  in  both  learning  an  applying 
graphical  models. 

The  same  sparse  graph  structure  which  enables  graphical  models  to  efficiently  encode  large  probability 
distributions  also  exposes  the  fundamental  dependencies  in  computation  and  present  an  opportunity  to 
leverage  the  wealth  of  existing  work  in  parallel  graph  algorithms.  For  example  our  work  in  Gibbs  sampling 
demonstrates  that  we  can  exploit  the  factorization  of  single  variable  Gibbs  updates  along  with  classical 
ideas  from  job  scheduling  on  a  dependency  graph  to  construct  a  parallel  Gibbs  sampler  that  is  provably 
correct  and  has  theoretically  quantifiable  mixing  acceleration. 

The  inherent  complexity  of  learning  and  inference  in  large  graphical  models  often  necessitates  approxima¬ 
tions.  By  restructuring  or  simply  reinterpreting  the  approximations  made  by  efficient  sequential  algorithms 
we  have  considerable  latitude  to  introduce  additional  parallelism.  For  example,  in  Chapter  3  we  show  how 
approximations  can  introduce  substantial  parallel  gains. 

To  conclude  this  section  we  briefly  summarize  several  of  the  key  properties  of  probabilistic  graphical 
models  that  will  enable  us  to  achieve  the  proposed  objectives: 

•  Graphical  models  provide  a  standardized  representation  enabling  the  design  and  implementation  of 
a  small  core  set  of  parallel  algorithms  which  span  a  wide  range  of  real-world  applications. 

•  Efficient  graphical  models  have  a  sparse  dependency  structure  which  facilitates  efficient  factorized 
(local)  parallel  computation. 

•  The  explicit  graph  structure  of  graphical  models  enables  us  to  leverage  classic  work  in  parallel  graph 
algorithms  and  job  scheduling. 

•  The  inherent  complexity  of  learning  and  inference  in  large  graphical  models  often  necessitates 
approximations  which  can  be  used  to  expose  substantial  additional  parallelism. 

2.2  Parallel  and  Distributed  Systems 

Unlike  the  sequential  setting  where  most  processor  designs  faithfully  execute  a  single  random  access 
computational  model,  the  parallel  setting  has  produced  a  wide  variety  of  parallel  architectures.  In  this  work 
we  focus  on  Multiple  Instruction  Multiple  Data  (MIMD)  parallel  architectures  which  includes  multicore 
processors  and  distributed  clusters.  The  MIMD  architecture  permits  each  processor  to  asynchronously 
execute  its  own  sequence  of  instruction  (MI)  and  operate  on  separate  data  (MD).  Furthermore  we  divide 
the  MIMD  architectures  into  Shared  Memory  (SM-MIMD)  represented  by  multi-core  processors  and 
Distributed  Memory  (DM-MIMD)  represented  by  large  commodity  clusters  and  cloud  systems.  This 
section  will  briefly  review  the  design  and  unique  opportunities  and  challenges  afforded  by  shared  and 
distributed  memory  MIMD  systems  in  the  context  of  parallel  machine  learning. 


2.2.1  Shared  Memory 

The  shared-memory  architectures  arc  a  natural  generalization  of  the  standard  sequential  processor  to  the 
parallel  setting.  In  this  work  we  assume  that  each  processor  is  identical  and  has  symmetric  access  to  a 
single  shared  memory.  All  inter-processor  communication  is  accomplished  through  shared  memory  and 


17 


(a)  Shared  Memory 


(b)  Distributed  Memory 


Figure  2.2:  The  two  common  parallel  architectures  encountered  in  typical  commodity  hardware,  (a)  The  standard  shared 
memory  architectural  model  consists  of  a  collection  of  processors  that  have  low  latency  high  bandwidth  access  to  a  single  uniform 
address  space.  All  inter-processor  communication  is  accomplished  through  the  shared  memory,  (b)  The  standard  distributed 
memory  architectural  model  consists  of  a  collection  of  processors  each  with  their  own  local  memory  which  are  able  to  communicate 
by  passing  messages  across  a  reliable  but  typically  slow  communication  network. 


is  typically  very  fast  (on  the  order  of  a  memory  read).  Consequently,  shared  memory  architectures  are 
particularly  convenient  because  they  do  not  require  partitioning  the  algorithm  state  and  enable  frequent 
interaction  between  processors. 

The  class  of  parallel  architectures  that  implement  symmetric  access  to  shared  memory  are  commonly 
referred  to  as  symmetric  multi-processors  (SMP).  Most  modern  SMP  systems  are  composed  of  one  or  more 
multi-core  processors.  Each  multi-core  chip  is  itself  composed  of  several  processors  typically  sharing  a 
local  cache.  These  systems  must  therefore  implement  sophisticated  mechanisms  to  ensure  cache  coherency, 
limiting  the  potential  for  parallel  scaling.  Nonetheless,  moderately  sized  mutli-core/SMP  can  be  found  in 
most  modem  computer  systems  and  even  in  many  embedded  systems.  It  is  expected  that  the  number  of 
cores  in  standard  SMP  systems  will  continue  to  increase  [Asanovic  et  al.,  2006]. 

While  multi-core  systems  offer  the  convenience  of  shared  memory  parallelism  they  introduce  challenges 
in  data  intensive  computing.  By  dividing  the  already  limited  bandwidth  to  main  memory  over  several 
processors,  multi-core  systems  quickly  become  memory  bandwidth  constrained.  Therefore,  efficient 
multi-core  algorithms  need  to  caref  ully  manage  memory  access  patterns. 


2.2.2  Distributed  Memory  and  Cluster  Computing 

Distributed  memory  systems  arc  found  in  systems  ranging  from  small  clusters  to  elastic  cloud  computing 
services  like  EC2.  Distributed  memory  systems  are  composed  of  a  collection  of  independent  processors  each 
with  its  own  local  memory  and  connected  by  a  relatively  slow  communication  network.  Distributed  memory 
algorithms  must  therefore  address  the  additional  costs  associated  with  communication  and  distributed 
coordination.  While  there  are  many  performance  and  connectivity  models  for  distributed  computing 
[Bertsekas  and  Tsitsiklis,  1989],  most  provide  point-to-point  routing,  a  maximum  latency  which  scales 
with  the  number  of  processors,  and  a  limited  bandwidth. 

Unlike  the  shared  memory  setting,  the  distributed  setting  often  considers  systems  with  network  and 
processor  failures.  As  clusters  become  larger,  rare  failure  modes  become  more  common  and  the  chances 
of  machine  and  network  failure  increases.  The  ability  to  recover  from  and  in  many  cases  tolerate  limited 
system  failure  has  been  a  focus  of  recent  work  in  distributed  computing  systems  and  cloud  environments. 
For  many  of  the  moderately  sized  clusters  used  in  research  some  of  the  emphasis  on  fault-tolerance  may 
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be  unnecessary  (see  Section  5.7.3).  In  our  work  on  graphical  model  inference  algorithms  we  will  initially 
assume  that  all  resources  remain  available  throughout  execution  and  that  all  messages  eventually  reach 
their  destination.  We  then  relax  this  assumption  in  the  GraphLab  system  by  introducing  relatively  simple 
check-point  and  recover  fault-tolerance  model. 

Typical  distributed  memory  systems  range  from  hundreds  of  processors  to  thousands  of  processors.  These 
system  employ  fast  commodity  networks  with  switch  hierarchies  or  highly  tuned  system  specific  inter¬ 
connects.  Often  distributed  memory  systems  arc  composed  of  smaller  shared  memory  systems,  typically 
having  several  multi-core  processors  at  each  node.  Here  we  will  use  the  term  node  to  refer  to  a  single 
machine  on  the  network  and  we  will  use  the  term  processor  to  refer  each  processing  element.  For  instance, 
a  network  of  8  machines,  each  with  2  quad-core  CPUs,  has  8  nodes,  64  processors. 

By  eliminating  the  need  for  cache  coherency,  the  distributed  memory  model  permits  the  construction  of 
substantially  larger  systems  and  also  offers  several  key  advantages  to  highly  data  intensive  algorithms. 
Because  each  processor  has  its  own  memory  and  dedicated  bus,  the  distributed  memory  setting  provides 
linear  scaling  in  memory  capacity  and  memory  bandwidth.  In  heavily  data  intensive  algorithms  linear 
scaling  in  memory  can  provide  substantial  performance  gains  by  eliminating  memory  bottlenecks. 

Because  of  the  latency  associated  with  network  communication  and  the  limited  network  bandwidth,  efficient 
algorithms  in  the  distributed  memory  model  must  minimize  inter-processor  communication.  Consequently, 
efficient  distributed  memory  algorithms  must  deal  with  the  challenges  associated  with  partitioning  both  the 
state  and  execution  in  a  way  that  minimizes  network  congestion  while  still  ensuring  that  no  one  processor 
has  disproportionally  greater  work. 
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Chapter  3 


Probabilistic  Inference:  Belief 
Propagation 


In  this  chapter  we  evaluate  the  thesis  statement  in  the  context  of  the  widely  used  [Baron  et  al.,  2010, 
Jaimovich  et  al.,  2006,  Lan  et  al.,  2006,  Singla  and  Domingos,  2008,  Sun  et  al.,  2002]  Loopy  Belief 
Propagation  (BP)  approximate  inference  algorithm.  Many  [Kang  et  al.,  2010,  Mendiburu  et  al.,  2007, 
Sun  et  al.,  2002]  have  proposed  synchronous  parallelizations  of  the  Loopy  BP  algorithm.  We  will  show 
that  synchronous  parallelization  of  Loopy  BP  is  highly  inefficient  and  propose  a  series  of  more  efficient 
parallelizations  by  adhering  to  the  design  methodology  described  within  the  thesis  statement.  In  the  process 
we  develop  the  SplashBP  algorithm  which  combines  new  scheduling  ideas  to  address  the  limitations  of 
existing  sequential  BP  algorithms  and  achieves  theoretically  optimal  performance.  Finally,  we  present  how 
to  efficiently  implement  BP  algorithms  in  the  distributed  setting  by  addressing  the  challenges  of  distributed 
state  and  load-balancing.  We  provide  both  theoretical  and  real-world  experimental  analysis  along  with 
implementation  specific  details  addressing  locking  and  efficient  data-structures.  In  addition,  C++  code  for 
the  algorithms  and  experiments  presented  in  this  chapter  can  be  obtained  from  our  online  repository  at 
http ://gonzalezlab s/the sis. 


3.1  Belief  Propagation 

A  core  operation  in  probabilistic  reasoning  is  inference  (see  Section  2.1.3)  —  the  process  of  computing  the 
probability  of  an  event.  While  inference  is  NP-hard  [Cooper,  1990]  in  general,  there  are  several  popular 
approximate  inference  algorithms  which  typically  perform  well  in  practice.  Unfortunately,  approximate 
inference  is  still  computationally  intensive  [Roth,  1993]  and  therefore  can  benefit  from  parallelization.  In 
this  chapter  we  parallelize  Loopy  Belief  Propagation  (or  BP  in  short)  which  is  used  in  a  wide  range  of 
ML  applications  [Baron  et  al.,  2010,  Jaimovich  et  al.,  2006,  Lan  et  al.,  2006,  Singla  and  Domingos,  2008, 
Sun  et  al.,  2002],  For  a  more  detailed  review  of  the  BP  and  it  generalizations  we  recommend  reading  the 
tutorial  by  Yedidia  et  al.  [2003]. 

Belief  propagation  (BP),  or  the  Sum-Product  algorithm,  was  introduced  by  Pearl  [1988]  as  an  exact 
inference  procedure  for  tree-structured  models  but  was  later  popularized  as  Loopy  Belief  Propagation,  an 
approximate  inference  algorithm  for  models  with  cycles  or  “loops.”  However,  for  simplicity  we  will  refer 
to  Loopy  Belief  Propagation  as  BP. 
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Figure  3.1 :  The  solid  message  leaving  the  solid  center  vertex  depends  only  on  the  shaded  messages  entering 
the  center  vertex. 


Computation  in  the  BP  algorithm  factorizes  over  the  graphical  model  structure.  The  BP  algorithm  associates 
a  vector  of  parameters  called  a  message1  along  each  direction  of  every  edge  in  the  factor  graph.  The 
messages  my4/A  and  rrifA^x,  along  the  edge  connecting  variable  X,  to  factor  /a  are  positive  functions 
(tables)  mapping  assignments  to  X,;  to  the  positive  reals.  In  loopy  graphs,  it  is  difficult  to  provide  a  direct 
probabilistic  interpretation  of  the  messages.  While  the  messages  technically  correspond  to  Langrange 
multipliers  in  the  Bethe  approximation,  it  may  help  to  think  of  the  message  rrifA^xt  as  encoding  a 
distribution  over  the  variable  Xt. 

Starting  with  messages  encoding  uniform  distributions,  the  BP  algorithm  iteratively  recomputes  (updates) 
messages  using  the  following  update  (fixed-point)  equations: 

mXi-+fA(xi)  oc  JJ  mh^Xi(xi)  (3.1) 

/bSAXA/a 

mfx-+Xi{xi)  oc  /a(xa)  n  mXk-+fA(xk)  (3.2) 

\-Ei  X h  —  \X  i 


where  Eq.  (3.1)  is  the  message  sent  from  variable  X,  to  factor  /a  and  Eq.  (3.2)  is  the  message  sent  from 
factor  /a  to  variable  Xt.  The  sum,  Yl^A\Xi’ 1S  computed  over  all  assignments  to  xa  excluding  the  variable 
X{,  and  the  product,  Y\Xk&Nf  \x- >  *s  computed  over  all  neighbors  of  the  vertex  /a  excluding  vertex  X,t.  To 
ensure  numerical  stability  the  messages  arc  typically  normalized. 

A  critical  property  of  the  BP  update  equations  (Eq.  (3.1)  and  Eq.  (3.2))  is  that  they  only  depend  on  the 
inbound  messages  associated  with  edges  adjacent  to  the  corresponding  vertex  (see  Figure  3.1).  As  a 
consequence  the  BP  update  equations  demonstrate  the,  factorized  locality  structure  needed  for  efficient 
parallel  and  distributed  computation.  In  addition  we  will  also  exploit  the  property  that  the  outbound 
message  does  not  depend  on  the  inbound  message  along  the  same  edge. 

The  update  equations  are  iteratively  applied  until  a  convergence  condition  is  met.  Typically,  convergence  is 

'Even  though  “message”  is  the  traditional  terminology  and  the  words  “send”  and  “receive”  are  frequently  used,  it  should  not 
be  interpreted  as  a  form  of  communication.  The  "message”  is  simply  a  numeric  vector  used  in  the  derivation  of  the  loopy  BP 
algorithm  in  Pearl  [1988]. 
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defined  in  terms  of  a  bound  on  the  change  in  message  values  between  iterations: 


max 

(ij)eE 


(new)  (old) 


</?• 


(3.3) 


for  a  small  constant  f3  >  0.  Unfortunately,  in  cyclic  graphical  models  there  are  few  convergence  guarantees 
[Ihler  et  al.,  2005,  Mooij  and  Kappen,  2007,  Tatikonda  and  Jordan,  2002]. 

At  convergence,  the  variable  and  clique  marginals,  also  called  the  beliefs  (6)  are  estimated  using: 


P  (Xt  =  Xi) 


P  (XA  =  xA) 


bXi{xi)o c  Yl  mh^x,(xi) 

/agAOc. 

&xa(xa)  oc  /a(xa)  n  mXj^fA{xj). 


(3.4) 

(3.5) 


In  other  words,  the  approximate  marginal  distribution  of  a  variable  is  simply  the  (normalized)  product  of 
all  of  its  incoming  messages.  Similarly,  the  approximate  marginal  of  all  variables  within  a  factor  can  be 
estimated  by  multiplying  the  factor  with  all  of  its  incoming  messages. 


3.1.1  The  Message  Schedule 

The  order  in  which  messages  are  updated  is  called  the  schedule  and  plays  a  central  role  in  efficient  BP. 
For  instance,  in  tree  graphical  models,  a  simple  procedure  known  as  the  forward-backward  schedule  was 
shown  by  Pearl  [1988]  to  yield  exact  marginals  using  0(2  \  E\  )  message  calculations.  First,  messages  arc 
computed  stalling  from  the  leaves  and  in  the  direction  of  an  arbitrarily  chosen  root.  The  process  is  then 
reversed  computing  all  the  messages  in  the  opposite  direction.  The  reader  may  observe  from  Eq.  (3.1)  and 
Eq.  (3.2)  that  this  forward-backward  (or  perhaps  upward-downward)  schedule  achieves  exact  convergence. 
That  is,  re-evaluation  of  any  message  using  the  message  update  equations,  will  not  change  the  message 
values. 

Unfortunately,  choosing  the  best  schedule  on  loopy  graphs  is  often  difficult  and  can  depend  heavily  on 
the  factor  graph  structure  and  even  the  model  parameters.  For  simplicity,  many  applications  of  loopy  BP 
adopt  a  synchronous  schedule  in  which  all  messages  arc  simultaneously  updated  using  messages  from  the 
previous  iteration.  Alternatively,  in  some  cases  an  asynchronous  schedule  is  employed,  in  which  messages 
arc  updated  sequentially  using  the  most  recent  inbound  messages.  For  example,  the  popular  round-robin 
asynchronous  schedule,  sequentially  updates  the  messages  in  fixed  order  which  is  typically  a  random 
permutation  over  the  vertices.  Advances  by  Elidan  et  al.  [2006]  and  Ranganathan  et  al.  [2007]  have  focused 
on  dynamic  asynchronous  schedules,  in  which  the  message  update  order  is  determined  as  the  algorithm 
proceeds.  Other  work  by  Wainwright  et  al.  [2001]  focus  on  tree  structured  schedules,  in  which  messages 
are  updated  along  collections  of  spanning  trees. 

In  this  chapter  we  demonstrate  that  by  varying  the  BP  schedule  we  can  affect  the  efficiency,  speed,  and 
parallelism  of  BP.  By  applying  the  GrAD  design  methodology  we  develop  computationally  efficient  parallel 
schedules  that  simultaneously  focus  computation  and  respect  the  sequential  structure  of  the  underlying 
graphical  model  while  exposing  a  high-degree  of  parallelism.  In  the  process  we  unify  and  refine  the  work 
on  dynamic  scheduling  [Elidan  et  al.,  2006]  and  tree  scheduling  [Wainwright  et  al.,  2001]  to  achieve 
state-of-the-art  performance  in  the  sequential,  parallel,  and  distributed  settings. 
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Algorithm  3.1:  Synchronous  BP 

Input:  Graph  G  =  (V,  E)  and  all  messages  { :  V(i,  j)  6  G} 
while  not  converged  do 

//  Embarrassingly  Parallel  Computation  Step 
forall  j  €  Afv  in  the  neighbors  of  v  do  in  parallel 

Compute  Message  using  {  } 

L  -'l  l  ieAfv 

//  Synchronization  Phase 
Swap(m/old\  ?n(new^) 


3.1.2  Opportunities  for  Parallelism  in  Belief  Propagation 

Because  BP  update  equation  factor  the  inference  computation  over  the  graph  they  exposes  substantial 
graph-level  parallelism.  Moreover,  all  the  messages  in  the  graph  could  be  computed  simultaneously 
and  in  parallel:  leading  to  the  Synchronous  BP  algorithm  which  we  will  present  next  in  Section  3.2.  In 
addition  BP  exposes  opportunities  for  more  fine-grained  factor-level  parallelism  through  the  individual 
message  calculations  (sums  and  products)  which  can  be  expressed  as  matrix  operations  which  can  be 
parallel izcd  relatively  easily  [Bertsekas  and  Tsitsiklis,  1989].  For  typical  message  sizes  where  the  number 
of  assignments  is  much  less  than  the  number  of  vertices  (X%  <<  n),  graph  level  parallelism  provides  more 
potential  gains  than  factor  level  parallelism.  For  instance,  the  message  update  equations  do  not  offer  much 
parallelism  if  all  variables  are  binary.  Therefore,  we  will  ignore  factor  level  parallelism  in  this  chapter  and 
instead  focus  on  graph  level  parallelism.  Running  time  will  be  measured  in  terms  of  the  number  of  message 
computations,  treating  individual  message  updates  as  atomic  unit  time  operations. 

Pennock  [1998]  and  Xia  et  al.  [2009]  explore  parallel  transformations  of  the  graph  which  go  beyond 
the  standard  message  calculations.  While  these  techniques  expose  additional  parallelism  they  require 
low  tree-width  models  and  tractable  exact  inference,  which  are  unreasonably  strong  requirements  when 
operating  on  massive  real-world  distributions.  In  acyclic  models  these  techniques  can  achieve  running  times 
that  are  logarithmic  in  the  tree  height  (as  apposed  to  the  linear  running  times  presented  in  this  chapter)  but 
require  considerable  fine  grained  parallelism.  In  this  chapter  we  restrict  our  attention  to  message  passing 
algorithms  on  the  original  graph  and  do  not  require  low  tree-width  models. 


3.2  Synchronous  BP 


Running  BP  with  a  synchronous  message  schedule  leads  to  the  highly  parallel  Synchronous  BP  algorithm 
shown  in  Alg.  3.1.  In  fact  Synchronous  BP  perfectly  fits  the  classic  Bulk  Synchronous  Parallel  (BSP) 
model  of  computation  (described  in  greater  detail  in  Section  5.3).  The  BSP  model  divides  computation  into 
super-steps  each  consisting  of  two  phases.  During  the  first  phase,  each  processor  executes  independently 
updating  its  local  BP  messages.  In  the  second  phase,  all  processors  exchange  BP  messages.  Finally, 
a  barrier  synchronization  is  performed  before  the  next  iteration,  ensuring  all  processors  compute  and 
communicate  in  lockstep. 

In  the  shared  memory  setting  messages  are  exchanged  by  swapping  between  old  and  new  message  buffers  at 
the  end  of  each  iteration.  In  the  distributed  memory  setting  messages  must  be  sent  over  the  communication 
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Algorithm  3.2:  Map  Function  for  Synchronous  BP 
Input:  A  vertex  v  €  V  and  all  inbound  messages  {rrii^v}iejy 
Output:  Set  of  outbound  messages  as  key-value  pairs  (j.  mv^j ) 
forall  j  G  Mv  in  the  neighbors  of  v  do  in  parallel 
Compute  Message  rriv->j  using 
return  key-value  pair  (j,  mv->j) 


Algorithm  3.3:  Reduce  Function  for  Synchronous  BP 
Input:  The  key-value  pairs  {(n, 

Output:  The  belief  bv  for  vertex  v  as  well  as  the  {  (v,  pairs  (j,  mv^j) 

Compute  the  belief  bv  for  v  using  { (v,  mi^v)}i&J^v 
Return  bv  and  {(n,  mj^)}i£Ar# 


network.  In  either  case  Synchronous  BP  requires  two  copies  of  each  message  to  be  maintained  at  all 
times.  When  the  messages  arc  updated  their  new  the  values  in  m!old)  arc  used  as  input,  while  the  resultant 
values  arc  stored  in  m!  newk  In  addition,  the  task  of  assessing  convergence  is  relatively  straight  forward  for 
Synchronous  BP.  After  each  iteration  all  processors  (or  machines  in  the  distributed  setting)  vote  to  halt 
based  on  whether  all  the  messages  they  have  compute  satisfy  the  termination  condition  (Eq.  (3.3)). 


3.2.1  Synchronous  BP  in  MapReduce 

The  Synchronous  BP  algorithm  may  also  be  expressed  in  the  context  of  the  popular  Map-Reduce  framework. 
The  Map-Reduce  framework,  introduced  by  Dean  and  Ghemawat  [2004] ,  simplifies  designing  and  and 
implementation  large  scale  parallel  algorithms  and  has  therefore  been  widely  adopted  by  the  data-mining 
and  machine  learning  community  [Chu  et  al.,  2006].  Synchronous  BP  can  naturally  be  expressed  as  an 
iterative  MapReduce  algorithm  where  the  Map  operation  (defined  in  Alg.  3.2)  is  applied  to  all  vertices  and 
emits  destination-message  key-value  pairs  and  the  Reduce  operation  (defined  in  Alg.  3.3)  joins  messages 
at  their  destination  vertex,  updates  the  local  belief,  and  prepares  for  the  next  iteration.  It  is  important 
to  note  that  the  MapReduce  abstraction  was  not  originally  designed  for  iterative  algorithms,  like  belief 
propagation,  and  therefore  standard  implementations,  like  Hadoop,  incur  a  costly  communication  and  disk 
access  penalty  between  iterations. 


3.2.2  Synchronous  BP  Runtime  Analysis 


While  it  is  not  possible  to  analyze  the  running  time  of  Synchronous  BP  on  a  general  cyclic  graphical  model, 
we  can  analyze  the  running  time  in  the  context  of  tree  graphical  models.  In  Theorem  3.2.1  we  characterize 
the  running  time  of  Synchronous  BP  when  computing  exact  marginals  (with  3  =  0). 

Theorem  3.2.1  (Exact  Synchronous  BP  Running  Time).  Given  an  acyclic  factor  graph  with  n  vertices, 
longest  path  length  l,  and  p  <  2  (n  —  1)  processors,  parallel  synchronous  belief  propagation  will  compute 
exact  marginals  in  time  (as  measured  in  number  of  vertex  updates): 


0 
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Proof  of  Theorem  3.2.1.  The  proof  follows  naturally  from  the  optimality  of  the  standard  forward-backward 
scheduling  on  tree-structured  models.  On  the  kth  iteration  of  Synchronous  BP,  all  messages  have  traveled  a 
distance  of  k  in  all  directions.  We  know  from  the  forward-backward  algorithm  that  the  furthest  distance 
we  must  push  messages  is  l.  Therefore  it  will  take  at  most  l  iterations  for  Synchronous  BP  to  obtain  exact 
marginals. 

Each  iteration  requires  2(n  —  1)  message  calculations,  which  we  divide  evenly  over  p  processors.  Therefore 
it  takes  [2  (n  —  1)  /p]  time  to  complete  a  single  iteration.  Thus  the  total  running  time  is: 


2  (n  —  1) 

P 


<  l 


2  (n  -  1) 


P 


□ 

If  we  consider  the  running  time  given  by  Theorem  3.2.1  we  see  that  the  n/p  term  corresponds  to  the 
parallelization  of  each  synchronous  update.  The  length  l  of  the  longest  path  corresponds  to  the  limiting 
sequential  component  which  cannot  be  eliminated  by  scaling  the  number  of  processors.  As  long  as  the 
number  of  vertices  is  much  greater  than  the  number  of  processors,  the  Synchronous  BP  algorithm  achieves 
nearly  linear  parallel  scaling  and  therefore  appears  to  be  an  optimal  parallel  algorithm.  However,  an 
optimal  parallel  algorithm  should  also  be  efficient.  That  is,  the  total  work  done  by  all  processors  should  be 
asymptotically  equivalent  to  the  work  done  by  a  single  processor  running  the  optimal  sequential  algorithm. 
By  examining  the  simple  chain  graphical  model  we  will  show  that  there  are  O  (n)  unnecessary  parallel 
message  computations  at  each  iteration  leading  to  a  highly  suboptimal  parallel  algorithm. 


3.2.3  Analysis  of  Synchronous  Belief  Propagation  on  Chains 

To  illustrate  the  inefficiency  of  Synchronous  BP,  we  analyze  the  running  time  on  a  chain  graphical  model 
with  n  vertices.  Chain  graphical  models  act  as  a  theoretical  benchmark  by  directly  capturing  the  limiting 
sequential  structure  of  message  passing  algorithms  and  can  be  seen  as  a  sub-problem  in  both  acyclic  and 
cyclic  graphical  models.  Chain  graphical  models  also  represent  the  worst  case  acyclic  graphical  model 
structure  for  parallel  belief  propagation  because  they  contain  only  a  single  long  path  and  do  not  expose 
any  opportunities2  for  parallelism.  Conversely,  the  branching  structure  of  trees  permits  multiple  longest 
paths  and  exposes  an  opportunity  for  additional  parallelism  by  running  each  path  separately.  Even  in  cyclic 
graphical  models,  the  longest  paths  in  the  unrolled  computation  graph  again  reduces  to  a  chain  graphical 
model. 

It  is  well  known  that  the  forward-backward  schedule  (Figure  3.2a)  for  belief  propagation  on  chain  graphical 
models  is  optimal.  The  forward-backward  schedule,  as  the  name  implies,  sequentially  computes  messages 
from  77)1-5.2  to  mn_i_5.n  in  the  forward  direction  and  then  sequentially  computes  messages  from  mn^,n- 1 
to  77)  2-5.1  in  the  backward  direction.  The  running  time  of  this  simple  schedule  is  therefore  0  (n)  or  exactly 
2 (n  —  1)  message  calculations. 

If  we  run  the  Synchronous  BP  algorithm  using  p  =  2 (n  —  1)  processors  on  a  chain  graphical  model  of 
length  ?7,  we  obtain  a  running  time  of  exactly  n  —  1.  This  means  that  parallel  Synchronous  BP  algorithm 
obtains  only  a  factor  of  two  speedup  using  two  processors  per  edge,  almost  twice  as  many  processors  as  the 
number  of  vertices.  More  surprisingly,  if  we  use  fewer  than  n  —  1  processors,  the  parallel  synchronous 

2While  it  is  true  that  rake-compress  style  algorithms  as  described  by  Miller  and  Reif  [1985]  could  be  applied  to  chain  graphical 
models  they  do  not  generalize  to  cyclic  models. 
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(a)  Single  Sequential  (b)  Optimal  Parallel 


Figure  3.2:  (a)  The  optimal  forward-backwards  message  ordering  for  exact  inference  on  a  chain  using  a 
single  processor,  (b)  The  optimal  message  ordering  for  exact  inference  on  a  chain  using  two  processors. 


algorithm  will  be  slower  than  the  simple  sequential  forward-backward  algorithm  running  on  a  single 
processor.  Finally,  if  we  use  any  constant  number  of  processors  (for  example  p  =  2),  then  the  parallel 
Synchronous  BP  algorithm  will  run  in  quadratic  time  while  the  sequential  single  processor  algorithm  will 
run  in  linear  time. 

What  is  the  optimal  parallel  schedule  for  belief  propagation  on  chain  graphical  models?  Recall  that  in  the 
message  update  equations  (Eq.  (3.1)  and  Eq.  (3.2)),  the  message  does  not  depend  on  the  message 
rrij^i  (see  Figure  3.1).  In  the  sequential  forward-backward  algorithm,  messages  in  the  forward  sweep  do 
not  interact  with  messages  in  the  backward  sweep  and  therefore  we  can  compute  the  forward  and  backward 
sweep  in  parallel  as  shown  in  Figure  3.2b.  Using  only  p  =  2  processors,  simultaneously  computing 
messages  sequentially  in  the  forward  and  backward  directions  as  shown  in  Figure  3.2b,  we  obtain  a  running 
time  of  n  —  1  and  achieve  a.  factor  of  two  speedup. 

While  messages  may  be  computed  in  any  order,  information  is  propagated  sequentially.  On  every  iteration 
of  Synchronous  BP  only  a  few  message  computations  (in  the  case  of  chain  graphical  models  only  two 
message  computations)  contribute  to  convergence  while  the  rest  are  wasted.  Unfortunately,  there  arc  no 
parallel  schedules  which  achieve  greater  than  a  factor  of  two  speedup  for  exact  inference  on  arbitrary  chain 
graphical  models.  Fortunately,  there  are  parallel  schedules  which  can  achieve  substantially  better  scaling 
by  exploiting  a  frequently  used  approximation  in  loopy  BP. 


3.3  The  re  Approximation 


In  almost  all  applications  of  loopy  BP,  the  convergence  threshold  f3  in  Eq.  (3.3)  is  set  to  a  small  value 
greater  than  zero  and  the  algorithm  is  terminated  prior  to  reaching  the  true  fixed-point.  Even  when  /3  =  0, 
the  fixed  floating  point  precision  of  discrete  processors  result  in  early  termination.  Because  the  resulting 
approximation  plays  an  important  role  in  studying  parallel  belief  propagation,  we  provide  a  brief  theoretical 
characterization. 


We  can  represent  a  single  iteration  of  synchronous  belief  propagation  by  a  function  Ubp  which  maps  all  the 
messages  on  the  tlh  iteration  to  all  the  messages  m(,+ 1  j  =  UBp(m^)  on  the  (t  +  l)th  iteration.  The 
fixed  point  is  then  the  set  of  messages  m*  =  U\w{rn*)  that  arc  invariant  under  Ubp-  In  addition,  we  define 
a  max-norm  for  the  message  space 


OO 


max 


(t)  (t+1) 

>.j  ~  mUj 


(3.6) 


which  matches  the  norm  used  in  the  standard  termination  condition,  Eq.  (3.3). 
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Definition  3.3.1  (Definition  of  re-approximation).  We  define  t€  as: 


t€  =  min  t  s.  t. 


m 


(t) 


—  m 


< 


(3.7) 


the  number  of  synchronous  iterations  required  to  be  within  an  e  ball  of  the  BP  fixed-point.  Therefore 
a  Te- Approximation  is  the  approximation  obtained  from  running  synchronous  belief  propagation  for  Te 
iterations. 


A  common  method  for  analyzing  fixed  point  iterations  is  to  show  (assume)  that  the  (/bp  is  a  contraction 
mapping  and  then  use  the  contraction  rate  to  bound  the  number  of  iterations  for  an  e  level  approximation  of 
to*.  If  (/bp  is  a  max-norm  contraction  mapping  then  for  the  fixed  point  m*  and  0  <  a  <  1 . 


I|t/Bp( 


m)  —  m 


<  a  \  \m  —  rn 


Work  by  Mooij  and  Kappen  [2007]  provide  sufficient  conditions  for  (/bp  (to)  to  be  a  contraction  mapping 
under  a  variety  of  norms  including  the  max-norm  and  the  L] -norm  and  shows  that  BP  on  acyclic  graphs  is 
guaranteed  to  be  a  contraction  mapping  under  a  particular  spectral  norm. 

If  the  contraction  rate  a  is  known,  and  we  desire  an  e  approximation  of  the  fixed  point,  re  is  the  smallest 
value  such  that  aTe  ||mo  —  m*!^  <  e.  This  is  satisfied  by  setting 


logC 2/e) 
log(l/a) 


(3.8) 


Finally,  in  Eq.  (3.9)  we  observe  that  the  convergence  criterion,  ||m  —  /(m)||,x  defined  in  Eq.  (3.3),  is 
a  constant  factor  upper  bound  on  the  distance  between  to  and  the  fixed  point  to*.  If  we  desire  an  e 
approximation,  it  is  sufficient  to  set  the  convergence  criterion  (3  <  e(l  —  a). 


|TO  —  TO*!^  =  \\m  —  (Zbp(to)  +  (/bp  (TO)  —  TO*^ 

<  ||to  -  (7bp(to)||00  +  ||[/bp(to)  -  to* 

<  ||to  -  [7bp(to)||00  +  a  \\m  -  m*^ 

1 

—  1 1  m  —  /  /r>r»  ( m  )  1 1 

oo 


|  TO  —  TO  Hgjj  < 


1  —  a 


I  to.  -  [/Bp(m.)||c 


(3.9) 


In  practice,  the  contraction  rate  a  is  likely  to  be  unknown  and  Ub p  will  not  be  a  contraction  mapping  on 
all  graphical  models.  Furthermore,  it  may  be  difficult  to  determine  re  without  first  running  the  inference 
algorithm.  Ultimately,  our  results  only  rely  on  re  as  a  theoretical  tool  for  comparing  inference  algorithms 
and  understanding  parallel  convergence  behavior.  In  practice,  rather  than  constructing  a  (3  for  a  particular 
a  and  e,  we  advocate  the  more  pragmatic  method  of  picking  the  smallest  possible  value  that  resources 
permit. 


3.3.1  Graphical  Interpretation 

The  re -Approximation  also  has  a  graphical  interpretation.  Intuitively,  for  a  long  chain  graphical  model  with 
weak  (nearly  uniform)  edge  potentials,  distant  vertices  are  approximately  independent.  For  a  particular 
vertex,  an  accurate  approximation  to  the  belief  may  often  be  achieved  by  running  belief  propagation  on 
a  small  subgraph  around  that  vertex.  More  formally  to  achieve  an  e  level  approximation  for  a  particular 
message,  we  require  a  subgraph  which  contains  all  vertices  that  arc  reachable  in  rf  steps. 
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Figure  3.3:  Above  the  chain  graphical  model  we  plot  the  sequence  of  messages  from  vertex  1  to  vertex 
7.  For  visual  clarity  these  messages  are  plotted  as  continuous  distributions.  Below  the  chain  we  plot 
the  approximate  sequence  of  messages  stalling  at  vertex  4  assuming  a  uniform  message  is  received  from 
vertex  3.  We  see  that  the  message  that  finally  arrives  at  vertex  7  is  only  e  away  from  the  original  message. 
Therefore  for  an  e  level  approximation  at  vertex  7  messages  must  only  be  sent  a  distance  of  r  =  3. 


An  illustration  of  the  graphical  interpretation  is  given  in  Figure  3.3.  Suppose  in  the  sequential  forward  pass 
we  change  the  message  717,3-3.4  to  a  uniform  distribution  (or  equivalently  disconnect  vertex  3  and  4)  and 
then  proceed  to  compute  the  rest  of  the  messages  from  vertex  4  onwards.  Assuming  re  =  3  for  the  chosen 
e,  the  final  approximate  belief  at  67  would  have  less  than  e  error  in  some  metric  (here  we  will  use  Li-norm) 
to  the  true  probability  P  (XTe+\ ).  Vertices  re  apart  are  therefore  approximately  independent.  If  re  is  small, 
this  can  dramatically  reduces  the  length  of  the  sequential  component  of  the  algorithm. 


3.3.2  Empirical  Analysis  of  re 


The  77  structure  of  a  graphical  model  is  a  measure  of  both  the  underlying  chain  structure  as  well  as 
the  strength  of  the  factors.  The  strength  of  a  factor  refers  to  its  coupling  affect  on  the  variables  in  its 
domain.  A  factor  that  assigns  relatively  similar  probability  to  all  configurations  is  much  weaker  than  a 
factor  that  assigns  disproportionately  greater  probability  to  a  subset  of  its  assignments.  Graphical  models 
with  factors  that  tightly  couple  variables  along  chains  will  require  a  greater  r  to  achieve  the  same  e  level 
approximation. 

To  illustrate  the  dependence  r€  on  the  coupling  strength  of  the  factors  we  constructed  a  sequence  of  binary 
chain  graphical  models  each  with  1000  variables  but  with  varying  degrees  of  attractive  and  repulsive  edge 
factors.  We  parameterized  the  edge  factors  by 


fxiy 


e6  Xi  =  xi+i 

e1  otherwise 


(3.10) 


where  0  is  the  “strength”  parameter.  When  0  =  1,  every  variable  is  independent  and  we  expect  re  =  1 
resulting  in  the  shortest  runtime.  When  0  <  1  (corresponding  to  anti-ferromagnetic  potentials)  alternating 
assignments  become  more  favorable.  Conversely  when  0  >  1  (corresponding  to  ferromagnetic  potentials) 
matching  assignments  become  more  favorable.  We  constructed  16  chains  at  each  potential  with  vertex 
potentials  fXi  chosen  randomly  from  Unif(0, 1).  We  ran  synchronous  belief  propagation  until  a  fixed 
e  =  10-5  level  approximation  was  achieved  and  plotted  the  number  of  iterations  versus  the  potential 
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Potential  Strength  (9) 


(a)  re  versus  9 


(b)  re  versus  9 


Figure  3.4:  In  (a)  the  number  of  iterations  synchronous  BP  required  to  obtain  a  e  =  10-5  approximation 
on  a  1000  variable  chain  graphical  model  is  plotted  against  the  strength  of  the  pairwise  potential  9.  In  (b) 
the  log  average  message  error  is  plotted  as  a  function  of  the  walk  length  on  the  protein  pair-wise  markov 
random  fields  and  the  UW-Languages  MLN. 


strength  in  Figure  3.4a.  We  observe  that  even  with  relatively  strong  potentials,  the  number  of  iterations,  re, 
is  still  relatively  small  compared  to  the  length  of  the  chain. 

We  can  further  experimentally  characterize  the  decay  of  message  errors  as  a  function  of  propagation 
distance.  On  several  real  networks  we  ran  belief  propagation  to  convergence  (at  (3  =  10~10)  and  then 
simulated  the  effect  of  replacing  an  arbitrary  message  with  a  uniform  distribution  and  then  propagating  that 
error  along  random  paths  in  the  graph.  In  Figure  3.4b  we  plot  the  message  error  as  a  function  of  distance 
averaged  over  1000  random  walks  on  various  different  factor  graphs.  Walks  were  terminated  after  the  error 
fell  below  the  original  fj  =  10“ 10  termination  condition.  In  all  cases  we  see  that  the  affect  of  the  original 
message  error  decays  rapidly. 

The  concept  of  message  decay  in  graphical  models  is  not  strictly  novel.  In  the  setting  of  online  inference  in 
Russell  and  Norvig  [1995]  used  the  notion  of  Fixed  Lag  Smoothing  to  eliminate  the  need  to  pass  messages 
along  the  entire  chain.  Meanwhile,  work  by  Ihler  et  al.  [2005]  has  investigated  message  error  decay  along 
chains  as  it  relates  to  robust  inference  in  distributed  settings,  numerical  precision,  and  general  convergence 
guarantees  for  belief  propagation  in  cyclic  models.  Our  use  of  re  is  to  theoretically  characterize  the 
additional  parallelism  exposed  by  an  e  level  approximation  and  to  ultimate  guide  the  design  of  the  new 
optimal  parallel  belief  propagation  scheduling. 

The  re  -Approximation  enables  us  to  achieve  substantially  greater  parallelism  in  chain  models  with  only 
a  negligible  loss  in  accuracy.  The  definition  of  re  directly  leads  to  an  improved  parallel  running  time  for 
Synchronous  BP: 

Theorem  3.3.1  (re -Approximate  Synchronous  BP  Running  Time).  Given  an  acyclic  factor  graph  with  n 
vertices  and  p  <  2(n  —  1)  processors,  parallel  synchronous  belief  propagation  will  compute  re- approximate 
messages  in  time  (as  measured  in  number  of  vertex  updates): 


Proof  of  Theorem  3.3.1.  We  know  that  each  iteration  will  take  |"2(n  —  1  )/p]  time  to  complete.  From 


29 


Definition  3.3.1  it  follows  that  a  re-approximation  is  obtained  in  exactly  r€  iterations. 


□ 


The  important  consequence  of  Theorem  3.3. 1  is  that  the  fundamental  sequential  component  of  Synchronous 
BP  depends  on  the  effective  chain  length  determined  by  re.  Models  with  weaker  variable  interactions  will 
have  a  smaller  re  and  will  permit  greater  parallelism.  However,  even  with  the  reduced  runtime  afforded  by 
the  t€ -approximation,  we  can  show  that  on  a  simple  chain  graphical  model,  the  performance  of  Synchronous 
BP  is  still  far  from  optimal  when  computing  a  re  approximation.  We  formalize  this  intuition  through  the 
following  lower  bound  on  re-approximations  for  chain  graphical  models. 

Theorem  3.3.2  (re- Approximate  BP  Lower  Bound).  For  an  arbitrary  chain  graph  with  n  vertices  and  p 
processors,  a  lower  bound  for  the  running  time  of  a  Te-approximation  is: 


To  prove  Theorem  3.3.2  we  introduce  the  concept  of  awareness.  Intuitively,  awareness  captures  the  “flow” 
of  message  information  along  edges  in  the  graph.  If  messages  are  passed  sequentially  along  a  chain  of 
vertices  starting  at  vertex  i  and  terminating  at  vertex  j  then  vertex  j  is  aware  of  vertex  i. 

Definition  3.3.2  (Awareness).  Vertex  j  is  aware  of  vertex  i  if  there  exists  a  path  (tq, . . . ,  vf)  from  v\  =  i 
to  Vk  =  j  in  the  factor  graph  G  and  a  sequence  of  messages  [rriy^V2 , . . . ,  such  that  each 

message  was  computed  after  the  previous  message  in  the  sequence  t.\  <  ...  <  ti,:. 

Proof  of  Theorem  3.3.2.  Because  the  messages  sent  in  opposite  directions  are  independent  and  the  amount 
of  work  in  each  direction  is  symmetric,  we  can  reduce  the  problem  to  computing  a  re-approximation  in  one 
direction  (X\  to  Xn)  using  p/2  processors.  Furthermore,  to  achieve  a  re-approximation,  we  need  exactly 
n  —  re  vertices  from  {XTe+i, . . . ,  Xn}  to  be  re  left-aware  (i.e.,  for  all  i  >  re,  Xt  is  aware  of  Aj_Te).  By 
definition  when  n  —  re  vertices  first  become  re  left-aware  the  remaining  (first)  rt  vertices  are  maximally 
left-aware. 

Let  each  processor  compute  a  set  of  k  >  re  consecutive  message  updates  in  sequence  (e.g., 
[m\ 1  't 2 .  rri!//-,, . . . ,  mR  J).  Notice  that  it  is  never  beneficial  to  skip  a  message  or  compute  messages 
out  of  order  on  a  single  processor  since  doing  so  cannot  increase  the  number  of  vertices  made  left-aware. 
After  the  first  r€  updates  each  additional  message  computation  make  exactly  one  more  vertex  left-aware. 
Therefore  after  k  message  computations  each  processor  can  make  at  most  k  —  re  +  1  vertices  left-aware. 
Requiring  all  p/2  processors  to  act  simultaneously,  we  observe  that  pre-emption  will  only  decrease  the 
number  of  vertices  made  re  left-aware. 

We  then  want  to  find  a  lower  bound  on  k  such  that  the  number  of  vertices  made  left-aware  after  k  iterations 
greater  than  the  minimum  number  of  vertices  that  must  be  made  left-aware.  Hence  we  obtain  the  following 
inequality: 


n  —  Te  <  |(fe-re  +  1) 

2  n  (  2\ 

k  >  —  +re  1--  -1  (3.11) 

P  V  Pj 

relating  required  amount  of  work  and  the  maximum  amount  of  work  done  on  the  /rLh  iteration.  For  p  >  2, 
Eq.  (3. 1 1)  provides  the  desired  asymptotic  result.  □ 
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Algorithm  3.4:  Atomic  Parallel  Round-Robin  Algorithm 
a  <—  Arbitrary  permutation  on  {1, . . . ,  |Vj} 

Atomic  integer  i  <—  1 
do  in  parallel 

while  Not  Converged  do 

//  Reads  the  value  of  i  and  increments  i  atomically 
j  =  AtomicFetchAndIncrement(i) 

//  Get  the  vertex  to  update 
v  =  cr(j  mod  |Vj) 

//  Compute  out  messages  after  earlier  neighbors  complete. 
SendMessages (v) 


The  bound  provided  in  Theorem  3.3.2  is  surprisingly  revealing  because  it  concisely  isolates  the  fundamental 
parallel  (n/p)  and  sequential  (t€)  components  of  BP.  Furthermore,  as  we  scale  models,  holding  their 
parameters  fixed,  re  remains  constant  providing  optimal  parallel  scaling  albiet  with  rf  multiplicative 
inefficiency. 

In  the  next  few  sections  we  will  apply  the  GrAD  methodology  to  introduce  several  more  efficient  parallel 
BP  algorithms  that  will  attempt  to  address  this  multiplicative  dependence  of  nre/p  on  re  in  Theorem  3.3.1 
and  improve  upon  memory  efficiency  and  convergence. 


3.4  Round  Robin  Belief  Propagation 

In  contrast  to  the  Synchronous  BP  algorithm  in  which  updates  all  messages  simultaneously,  the  more 
efficient  forward-backward  algorithm  updates  messages  in  a  fixed  sequential  order.  Once  a  message  is 
updated  its  value  is  immediately  visible  to  any  future  message  calculations.  More  generally  given  any  fixed 
ordering  a  over  the  vertices,  the  inherently  sequential  Round-Robin  BP  algorithm  iteratively  sweeps  the 
vertices  in  the  order  a  recomputing  each  of  their  out  messages.  As  before,  once  a  message  is  computed  its 
new  value  is  used  in  all  message  calculation  for  vertices  later  in  a. 

We  can  obtain  a  parallel  Round-Robin  BP  algorithm  (Alg.  3.4)  by  choosing  a  particular  sweep  ordering 
<7  and  constructing  a  corresponding  parallel  execution.  Rather  than  one  processor  iteratively  updating 
messages  in  the  order  a  all  p  processors  asynchronously  update  p  vertices  at  a  time.  While  each  processor 
runs  independently,  the  next  vertex  v  =  <r(+  +  j  mod  |Vj)  is  obtained  by  atomically  advancing  the 
shared  schedule  a.  After  a  processor  gets  the  next  vertex  v  it  executes  SendMessages  (v)  and  checks 
for  local  convergence. 

We  introduce  a  locking  scheme  that  prevents  a  processor  from  updating  a  vertex  while  a  dependent  message 
from  an  earlier  vertex  is  being  processed.  The  SendMessages  ( v )  operation  (Alg.  3.5)  implements  this 
locking  scheme.  When  SendMessages  (v)  is  invoked  on  vertex  v  it  acquires  a  write  lock  on  v  and  read 
locks  on  all  adjacent  vertices.  As  a  consequence,  concurrent  invocations  of  SendMessages  on  adjacent 
vertices  are  serialized  ensuring  both  that  the  sequential  order  is  preserved  and  that  only  one  processor  is 
reading  or  writing  the  same  a  message  value  at  the  same  time.  To  prevent  deadlocks,  all  locks  are  acquired 
in  the  same  order  (a)  on  all  processors.  Once  all  the  adjacent  locks  have  been  acquired  all  out-bound 
messages  arc  updated  and  then  the  locks  arc  released. 
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Algorithm  3.5:  SendMessages(n) 

//  Sort  v  and  its  neighbors  (done  in  advance) 

o  •*—  Sort  (A fv  U  {u} )  ordered  by  a 

//  Lock  v  and  its  neighbors  in  order 

for  i  £  [1, . . . ,  |o|]  do 

if  a,  =  v  then  Acquire  Write  Lock  on  v 
else  Acquire  Read  Lock  on  neighbor  o; 

//  Compute  the  new  belief 
bv  oc 

if  v  is  a  factor  node  then  bv  oc  /,,  x  bv 
II  Compute  outgoing  messages 
for  u  £  Mv  do  ms-in  oc  ^  bv/mu^v 

X„\xu 

for  i  G  [1, . . . ,  |o|]  do 
j  Release  Lock  on  Oj 


The  locking  mechanism  used  in  SendMes  sages  (Alg.  3.5)  ensures  that  every  sweep  of  the  Round  Robin 
BP  algorithm  is  serializable:  there  is  a  corresponding  sequential  execution  It  is  possible  for  the 
serialization  <jW  to  differ  form  a  the  proposed  permutation.  However,  for  large  cyclic  models  where  a  is 
usually  a  random  permutation,  such  variation  is  not  typically  not  an  issue. 

In  the  shared  memory  setting  the  Round-Robin  BP  algorithm  uses  atomic  operations  (see  Section  2.2.1) 
which  are  typically  available  on  most  modern  processors  and  provide  the  coordination  needed  to  maintain 
the  shared  schedule.  In  the  distributed  setting  we  further  relax  serializability  constraints  by  partitioning  (see 
Section  3.8.1)  the  Round-Robin  schedule  across  machines  allowing  messages  along  the  boundary  between 
machines  to  be  computed  concurrently. 


3.5  Wildfire  Belief  Propagation 


While  Round  Robin  BP  tends  to  converge  faster  than  Synchronous  BP,  the  Round  Robin  BP  algorithm  can 
still  lead  to  a  significant  amount  of  unnecessary  computation.  For  example,  consider  a  graphical  model 
which  comprises  of  two  disconnected  regions.  The  first  region  takes  a  Round  Robin  sweeps  to  converge, 
while  the  second  region  takes  b  sweeps,  where  a  «  b.  Performing  full  Round  Robin  BP  sweeps  on  the 
entire  graph  will  therefore  take  b  iterations  to  converge.  The  first  region  will  run  for  b  —  a  more  iterations 
than  necessary  resulting  in  substantial  wasted  computation. 


A  simple  solution  to  the  problem  of  over  scheduling  an  already  converged  region  of  the  model  is  to 
simply  skip  vertices  that  have  converged.  The  resulting  Wildfire  algorithm  shown  in  Alg.  3.6  is  a  direct 
parallelization  of  the  algorithm  proposed  by  [Ranganathan  et  al.,  2007].  The  Wildfire  algorithm  is 
almost  identical  to  the  Round  Robin  algorithm  (Alg.  3.4)  except  that  only  messages  that  have  changed 
significantly: 


(new)  (old) 


>13 


(3.12) 


are  sent.  To  apply  this  definition  we  must  compute  all  the  new  outbound  messages  every  time  a  new 
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A 


B 


Figure  3.5:  If  Round  Robin  BP  is  used  on  this  graph,  the  left  paid  of  the  graph  (A)  will  converge  rapidly 
in  a  few  iterations,  while  the  right  paid  of  the  graph  (B)  may  take  much  longer  to  converge.  Running  full 
sweeps  therefore  wastes  computation. 


message  aixives.  However,  only  the  outbound  messages  that  have  changed  by  more  then  6  are  sent  and  the 
rest  are  discarded.  If  a  vertex  has  not  received  any  messages  since  the  last  Round  Robin  sweep  then  it  is 
not  necessary  to  recompute  its  outbound  messages. 


3.6  Residual  Belief  Propagation 


A  natural  extension  of  to  the  WildFire  BP  algorithm  is  to  prioritize  the  reception  of  the  most  influential 
messages  over  messages  that  have  not  changed  substantially.  Elidan  et  al.  [2006]  introduced  the  Residual 
BP  algorithm  which  prioritizes  messages  based  on  their  residual: 


(new)  (old) 


(3.13) 


which  is  analogous  to  the  skipping  condition  used  in  the  Wildfire  belief  propagation  algorithm.  The  residual 
prioritization  greedily  updates  the  message  that  most  violates  the  termination  condition.  In  settings  where 
belief  propagation  is  a  contraction  mapping,  Elidan  et  al.  [2006]  showed  that  the  largest  residual  is  an  upper 
bound  on  the  distance  to  fixed-point  and  therefore  the  residual  heuristic  is  greedily  minimizing  and  upper 
bound  on  the  underlying  BP  energy. 


Every  time  a  vertex  receives  one  message  it  actually  must  “re -receive”  all  messages  into  that  vertex. 
While  it  is  possible  to  “incrementally”  receive  each  new  message  doing  so  would  be  wasteful  since  every 
received  message  would  require  completely  recomputing  all  out-bound  messages.  Therefore  prioritizing 
messages  actually  corresponds  to  prioritizing  vertices.  Using  Eq.  (3.13)  corresponds  to  using  the  vertex 
priority: 


n  =  max 


(new)  (old) 

mUj  ~  mUj 


(3.14) 


which  prioritizes  vertices  based  on  the  new  message  with  largest  change.  Intuitively,  vertex  residuals 
capture  the  amount  of  new  information  available  to  a  vertex.  Recomputing  outbound  messages  from  a 
vertex  with  unchanged  inbound  messages  results  in  a  wasted  update.  Once  the  SendMessages  operation 
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Algorithm  3.6:  Parallel  Wildfire  Algorithm  Using  Atomics 


a  <—  Arbitrary  permutation  on  {1 .... .  |U|  } 

Atomic  integer  i  <—  1 
do  in  parallel 

while  Not  Converged  do 

//  Reads  the  value  of  i  and  increments  i  atomically 
j  =  AtomicFetchAndIncrement(i) 

//  Get  the  vertex  to  update 
v  =  o(j  mod  |U|) 
if  New  Messages  into  v  then 

Grab  neighborhood  locks  in  canonical  order 
forall  j  <E  A fv  in  the  neighbors  of  v  do  in  parallel 

id)  1 

J 


Compute  Message  using 


f  (°i< 

\mUv 


i£.Nv 


if 


(new)  (old) 

mv-+j  ~  mWj 


Send  message  m, 
Release  locks 


>  / 3  then 

(new) 

v-*j 


Algorithm  3.7 :  Residual  Belief  Propagation  Algorithm 
Priority  Queue  Q 

Initialize  Q  with  all  vertices  at  oo  priority 

do  in  parallel 

while  TopResid(<5)  >  (3  do 

n  =  Top(Q)//  Get  vertex  with  highest  residual 
SendMessages ( v ) 

Set  rv  =  0  and  update  neighbors  r\rv 


is  applied  to  a  vertex,  its  residual  is  set  to  zero  and  its  neighbors  residuals  are  updated.  Vertex  scheduling 
has  the  advantage  over  message  residuals  of  using  all  of  the  most  recent  information  when  updating  a 
message. 

The  parallel  Residual  BP  algorithm  in  Alg.  3.7  uses  a  priority  queue  to  store  the  vertices  in  residual  order. 
Each  processor  executes  SendMessages  on  the  next  highest  priority  vertex  from  Q  and  then  updates  the 
priorities  on  all  neighboring  vertices. 

When  the  message  calculations  are  fairly  simple,  the  priority  queue  becomes  the  central  synchronizing 
bottleneck.  An  efficient  implementation  of  a  parallel  priority  queue  is  therefore  the  key  to  performance 
and  scalability.  There  are  numerous  parallel  priority  queue  algorithms  in  the  literature  [Crupi  et  ah,  1996, 
Driscoll  et  al.,  1988,  Parberry,  1995,  Sanders,  1998].  Many  require  sophisticated  fine  grained  locking 
mechanisms  while  others  employ  binning  strategies  with  constraints  on  the  range  and  distribution  of 
the  priorities.  Because  the  residual  priorities  are  a  heuristic,  we  find  that  relaxing  the  strict  ordering 
requirement  can  further  improve  performance  by  reducing  priority  queue  contention.  In  our  implementation 
we  randomly  assigned  vertices  to  priority  queues  associated  with  each  processor.  Each  processor  then 
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draws  from  its  own  queue  but  can  update  the  priorities  of  vertices  owned  by  other  processors. 


3.6.1  Limitations  of  Message  Residuals 


The  message  centric  residual  defined  in  Eq.  (3.14)  may  lead  to  nonuniform  convergence  in  the  pos¬ 
terior  marginals  (beliefs)  and  starvation  of  low-degree  vertices.  In  this  section  we  characterize  these 
limitations  and  define  a  new  residual  measure  Eq.  (3.17)  that  directly  captures  changes  in  the  estimated 
marginals. 

Because  the  residual  definition  in  Eq.  (3.14)  only  considers  the  largest  message  change  for  each  vertex, 
high-degree  vertices  with  many  small  message  changes  may  satisfy  the  termination  condition  while  being 
far  from  convergence. 

Lemma  3.6.1.  A  small  e  changes  in  all  messages  to  a  vertex  of  degree  d  can  align  resulting  in  a  large  de 
change  in  the  belief. 


Proof  Consider  a  discrete  binary  random  variable  X{  with  d  =  \Aft  |  and  incoming  messages  {m\ , . . . ,  mr/ } 
which  are  all  uniform.  Then  the  belief  at  that  variable  is  also  uniform  (i.e.,  h,  =  [|,  {).  Suppose  we  then 
perturb  each  message  by  e  such  that  m'k{ 0)  =  1  —  e  and  m'k ( 1 )  =  |  +  e.  The  perturbed  messages  have 
changed  by  less  than  2e  (i.e.,  Vk  :  \\m'k  —  rrik\\1  <  2e)  and  by  the  classic  convergnec  criterion  Eq.  (3.3)  all 
messages  have  effectively  converged.  However  the  new  belief  is: 


m 


*) 


cT 


The  L  i  change  of  the  belief  due  to  the  compounded  e  change  in  each  message  is  then: 


6j(0)-6i(0)||1 


(l  +  e)d+a-e) 


d' 


By  constructing  a  2nd  order  Taylor  expansion  around  e  =  0we  obtain: 


b'M-b^W^de  +  Oie3). 


Therefore,  the  change  in  belief  varies  linearly  in  the  degree  of  the  vertex  enabling  small  e  message  residuals 
to  translate  into  large  de  belief  residuals.  □ 


Alternatively,  a  large  change  in  a  single  message  may  result  in  a  small  change  in  belief  at  a  high  degree 
vertex  resulting  in  the  over-scheduling. 

Lemma  3.6.2.  A  large  1  —  e  residual  in  a  single  message  may  result  in  a  small  e  change  in  belief  at  a  high 
degree  vertex. 


Proof.  Consider  the  set  {mi, . . . ,  mY; }  of  binary  messages  with  value  [1 
that  variable  would  be 


MO) 


(1-M 

(1  -e)d  +  ed' 


e,  e]  then  the  resulting  belief  at 
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If  we  then  change  m\  to  [e,  1  —  e]  then  the  resulting  belief  is 


m 


(1  —  e)d~1e 

(1  —  e)d~1e  +  ed~l(l  —  e) 


Assuming  that  0  <  e  <  \  and  d  >  3, 


1 

2 


(l-e)d _ (1  -  e)d~1e 

(1  —  e)d  +  ed  (1  —  e)rf_1e  +  ed_1(l  —  e) 


(1  —  e)d  1e  +  ed  X(1  —  e) 
(1  -  e)d~ 2 
(1  -  e)d~2  +  ed~2 


(1  —  e)d  2  +  ed  2 


We  can  bound  the  denominator  to  obtain: 


1 

2 


cd-  2 


(1/2) 


d— 3 


(2e)d-3e  < 


e 

2d-3 


In  Figure  3.6a  we  plot  the  Li  error  in  the  belief  1 6'  —  b,  1 1 , ,  varying  the  value  of  e.  In  Figure  3.6b,  we  plot 
the  L\  error  in  the  beliefs  against  the  L\  change  in  m i .  The  curves  look  nearly  identical  (but  flipped)  since 
the  L\  change  is  exactly  2  —  4e  (recall  that  mi  was  changed  from  [1  —  e,  e]  to  [e,  1  —  e]). 


(a)  Belief  Error  against  t  (b)  Belief  En'or  against  Message  Error 


Figure  3.6:  (a):  Given  a  vertex  of  degree  d  with  all  incoming  messages  equal  to  [1  —  e,  e].  This  graph  plots 
the  LI  change  of  belief  on  a  vertex  of  degree  d  caused  by  changing  one  message  to  [e,  1  —  e] .  (b) :  Similar 
to  (a),  but  plots  on  the  X-axis  the  LI  change  of  the  message,  which  corresponds  to  exactly  2  —  4e. 


□ 


3.6.2  Belief  Residuals 

The  goal  of  belief  propagation  is  to  estimate  the  belief  for  each  vertex.  However,  message  scheduling 
and  convergence  assessment  uses  the  change  in  messages  rather  than  beliefs.  Here  we  define  a  belief 
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residual  which  addresses  the  problems  associated  with  the  message-centric  approach  and  enables  improved 
scheduling  and  termination  assessment. 

A  natural  definition  of  the  belief  residuals  analogous  to  the  message  residuals  defined  in  Eq.  (3.13)  is 

rj  =  \\br-bf%  (3.15) 

where  6°ld  is  the  belief  at  vertex  i  the  last  time  vertex  i  was  updated.  Unfortunately,  Eq.  (3.15)  has  a 
surprising  flaw  that  admits  premature  convergence  on  acyclic  graphical  models  even  when  the  termination 
condition: 

max  I  |6"ew  —  6°ld|  L  <  (3  (3.16) 

iev  "  * 

is  set  to  (3  =  0. 

Example  3.6.3  (Premature  Termination  Using  Eq.  (3.16)).  To  see  how  Eq.  (3.15)  can  lead  to  premature 
termination  consider  a  chain  MRF  with  five  vertices  and  the  binary  factors  fxz.xi+]  ( Xj ,  Xj+i)  =  I[xt  = 
Xi- |_i]  and  unary  factors: 

fx i  =  [§>9]  fx2  =  [&>  Ifi]  fx 3  =  [§>§] 
fx 4  —  [l0  ’  To]  fx5  =  [9,  g] 

We  begin  by  initializing  all  vertex  residuals  to  infinity,  and  all  messages  to  uniform  distributions  and  then 
perform  the  following  update  sequence  marked  in  black:  After  stage  (b),  772,2-5.3  =  fx2  and  7774-5.3  = 


— ^ —  — © 


f X\ •  Since  bx3  oc  (772,2-5.3  x  7774-5.3)  oc  [1,1].  the  belief  at  X3  X3  will  have  uniform  belief.  Since 
X3  is  just  updated,  it  will  have  a  residual  of  0.  After  stage  (d),  in.2-,3  oc  (fx,  x  fx2)  oc  fx,  and 
7774-5.3  oc  (fx5  x  fx., )  oc  fx2-  These  two  messages  therefore  have  swapped  values  since  stage  (b).  Since 
bx  3  oc  (7772-5.3  x  7774-5.3),  the  belief  at  X3  will  not  be  changed  and  will  therefore  continue  to  have  uniform 
belief  and  zero  residual.  At  this  point  X2  and  X4  also  have  zero  residual  since  they  were  just  updated.  After- 
stage  (e),  the  residuals  at  X\  and  Xr,  at  set  to  0.  However,  the  residuals  on  X2  and  X4  remain  zero  since 
messages  777,1—5.2  and  7775—5.4  will  not  change  since  state  (c).  All  variables  therefore  now  have  a  residual  ofO. 
By  Eq.  (3.15)  with  f3  =  0  we  have  converged  prematurely  since  no  sequence  of  messages  connects  X  \  and 
X~,.  The  use  of  the  naive  belief  residual  in  Eq.  (3.15)  will  therefore  converge  to  an  erroneous  solution. 


An  alternative  formulation  of  the  belief  residual  which  does  not  suffer  from  premature  termination  and 
offers  additional  computational  advantages  is  given  by: 


„(*) 


(t-i)  , 
U  + 


b?  ~  btl) 


bf\xi)  oc 


bf  l) (x^mfXjixi) 


d-1)/  \ 


(3.17) 

(3.18) 
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bf  1  j  is  the  belief  after  incorporating  the  previous  message  and  tr  ]  is  the  belief  after  incorporating  the 
new  message.  Intuitively,  the  belief  residual  defined  in  Eq.  (3.17)  measures  the  cumulative  effect  of  all 
message  updates  on  the  belief.  As  each  new  message  arrives,  the  belief  can  be  efficiently  recomputed  using 
Eq.  (3.18).  Since  with  each  message  change  we  can  quickly  update  the  local  belief  using  Eq.  (3.18)  and 
corresponding  belief  residual  (Eq.  (3.17)),  we  do  not  need  to  retain  the  old  messages  and  beliefs  reducing 
the  storage  requirements  and  the  overhead  associated  with  scheduling. 

The  belief  residual  may  also  be  used  as  the  convergence  condition: 

max  r'i  <  (3  (3.19) 

i£V 

terminating  when  all  vertices  have  a  belief  residual  below  a  fixed  threshold  f3  >  0.  Since  Eq.  (3.17)  satisfies 
the  triangle  inequality,  it  is  an  upper  bound  on  the  total  change  in  belief  ensuring  that  the  algorithm  does 
not  change  while  their  are  beliefs  that  have  changed  by  more  than  B.  Because  Eq.  (3.17)  accumulates  the 
change  in  belief  with  each  new  message,  it  will  not  lead  to  premature  termination  scenario  encountered  in 
the  more  naive  belief  residual  definition  (Eq.  (3.15)). 

Since  the  belief  residual  of  a  vertex  can  only  increase  until  the  vertex  is  updated  and  all  vertices  with  belief 
residuals  above  the  termination  threshold  are  eventually  updated,  the  belief  residual  prevents  starvation.  In 
addition,  belief  residuals  also  address  the  issue  of  over- scheduling  in  high  degree  vertices  since  a  large 
change  in  a  single  inbound  message  that  does  not  contribute  to  a  significant  change  in  belief  will  not 
significantly  change  the  belief  residual. 

We  use  the  belief  residual  for  termination  assessment  and  as  a  priority  for  Wildfire  and  Residual  BR 


3.7  Splash  Belief  Propagation 

The  Round  Robin  BP,  WildFire  BP,  and  Residual  BP  algorithms  improve  upon  the  Synchronous  BP  algo¬ 
rithm.  The  Round  Robin  BP  algorithm  introduces  asynchronous  execution  allowing  each  message  update 
to  access  the  most  recent  version  of  its  dependent  messages  and  eliminating  the  need  for  global  barriers. 
The  WildFire  BP  algorithm  improves  upon  the  Round  Robin  algorithm  by  eliminating  wasted  computation 
as  messages  converge  at  different  rates.  Finally  the  Residual  BP  algorithm  focuses  computation  resources 
where  they  can  make  the  most  progress  further  accelerating  convergence.  Unfortunately,  despite  all  of 
these  improvements,  all  the  algorithms  we  have  described  so  far  share  the  same  asymptotic  0(nre/p  +  re) 
performance  on  chain  graphical  models.  The  problem  is  none  of  these  algorithms  directly  address  the 
forward-backward  sequential  structure  of  inference. 


3.7.1  Optimal  Parallel  Scheduling  on  Chains 

We  begin  by  developing  a  simple  algorithm  (Alg.  3.8)  that  composes  the  forward-backward  schedule 
to  efficiently  construct  a  rf -approximation  in  parallel  and  achieve  the  asymptotic  lower  bound  given  in 
Theorem  3.3.2.  Alg.  3.8  begins  by  dividing  the  chain  evenly  among  the p  processors  as  shown  in  Figure  3.7a. 
Then,  in  parallel,  each  processor  runs  the  sequential  forward-backward  algorithm  on  its  sub-chain  as  shown 
in  Figure  3.7b.  Finally  each  processor  exchanges  messages  along  the  boundaries  and  repeats  the  procedure 
until  convergence.  Notice,  the  algorithm  does  not  require  knowledge  of  re  and  instead  relies  on  the 
convergence  criterion  to  ensure  an  accurate  approximation. 
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(c)  Message  Exchange 

Figure  3.7 :  An  illustration  of  the  optimal  scheduling  on  a  chain,  (a)  The  chain  is  partitioned  evenly  among  the  processors,  (b) 
In  parallel  each  processor  runs  the  forward-backward  sequential  scheduling.  The  forward  sweep  updates  messages  towards  the 
center  vertex  and  the  backwards  sweep  updates  messages  towards  the  boundary.  This  slightly  abnormal  forward-backward  sweep 
will  be  more  natural  when  we  generalize  it  to  arbitrary  cyclic  graphs  in  the  Parallel  Splash  Algorithm,  (c)  Messages  that  cross  the 
processor  boundaries  are  exchanged  and  steps  (b)  and  (c)  are  repeated  until  convergence. 


We  now  show  that  this  simple  procedure  achieves  the  asymptotic  lower  bound: 

Theorem  3.7.1  (Optimality  of  Alg.  3.8).  Given  a  chain  graph  with  n  vertices  and  p  <  n  processors,  the 
ChainSplash  belief  propagation  algorithm,  achieves  a  re  level  approximation  for  all  vertices  in  time 


O 


Proof  As  with  the  lower  bound  proof  we  will  consider  messages  in  only  one  direction  from  x,  to  xr+ 1 . 
After  each  local  execution  of  the  forward-backward  algorithm  all  vertices  become  left-aware  of  all  vertices 
within  each  processor  block.  After  messages  are  exchanged  the  left  most  vertex  becomes  left-aware  of 
the  right  most  vertex  on  the  preceding  processor.  By  transitivity  of  left-awareness,  after  k  iterations,  all 
vertices  become  left-aware  of  (k  —  l)n/p  vertices.  We  require  all  vertices  to  be  made  re  left-aware  and  so 
solving  for  /c-iterations  we  obtain: 
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Each  iteration  is  executed  in  parallel  in  time  n/p  so  the  total  running  time  is  then: 
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Algorithm  3.8:  ChainSplash  BP  Algorithm 
Input:  Chain  graphical  model  ( V. ,  E)  with  n  vertices 
//  Partition  the  chain  over  the  p  processors 

forall  *  €  {1, ...  ,p}  do  in  parallel 

j_  i  TO- l)n/pl ,  .  .  .  ,  X  [in/p]  —  1 } 

while  Not  Converged  do 

//  Run  Forward-Backward  Belief  Propagation  on  each  Block 

forall  i  £  { 1 , . . . ,  p}  do  in  parallel 

|  Run  Sequential  Forward-Backward  on 

//  Exchange  Messages 

forall  i  €  { 1 , . . . ,  p}  do  in  parallel 

|  Exchange  messages  with  /i,_  i  and  B,+\ 


Our  new  Splash  BP  algorithm  generalizes  Alg.  3.8  to  arbitrary  cyclic  graphical  models.  The  Splash 
algorithm  is  composed  of  two  core  components,  the  Splash  operation  which  generalizes  the  forward- 
backward  schedule  to  arbitrary  cyclic  graphical  models,  and  a  dynamic  Splash  scheduling  heuristic  which 
ultimately  determines  the  shape,  size,  and  location  of  the  Splash  operation.  We  will  first  present  the  Parallel 
Splash  algorithm  as  a  sequential  algorithm,  introducing  the  Splash  operation,  belief  scheduling,  and  how 
they  are  combined  to  produce  a  simple  single  processor  Splash  algorithm.  Then  in  subsequent  sections 
we  describe  the  parallel  structure  of  this  algorithm  and  how  it  can  efficiently  be  mapped  to  shared  and 
distributed  memory  systems. 


3.7.2  The  Splash  Operation 

The  Splash  BP  algorithm  is  built  around  the  Splash3  operation  (Alg.  3.9  and  Figure  3.8)  which  generalizes 
the  forward-backward  schedule  illustrated  in  Figure  3.2a  by  constructing  a  small  sub-tree  and  then  executing 
a  local  forward-backward  schedule  on  the  sub-tree.  By  scheduling  message  calculations  along  a  local  tree, 
we  directly  address  the  sequential  message  dependencies  and  retain  the  optimal  forward-backward  schedule 
when  applied  to  acyclic  models. 

The  inputs  to  the  Splash  operation  arc  the  root  vertex  v  and  the  maximum  allowed  size  of  the  Splash,  W. 
The  Splash  operation  begins  by  constructing  a  sub-tree  rooted  at  v,  adding  vertices  in  breadth  first  order 
until  the  sub-tree  exceeds  W  which  is  measured  in  terms  of  number  of  floating  point  operations  associated 
with  executing  BP  inference  on  the  sub-tree.  The  size  of  each  sub-tree  is  restricted  to  ensure  work  balance 
across  processors,  which  will  execute  multiple  Splash  operations  in  parallel.  The  work  associated  with 
each  vertex  u  (which  could  be  a  variable  or  factor)  depends  on  the  size  of  the  messages  and  the  number  of 
neighbors  in  the  graph: 

wu  =  \K\  x  \%u\  +  ''l  ]  \%v\  ■>  (3.20) 

xveNu 

where  \J\fu\  x  \XU\  represents  the  work  required  to  recompute  all  outbound  messages  and  Ylx  eN  I ^-v I  is 
the  work  required  to  update  the  beliefs  of  all  the  neighboring  vertices  (needed  for  dynamic  scheduling).  In 

3The  algorithm  is  called  Splash  BP  because  the  forward-backward  “wavefront”  seen  when  running  the  algorithm  on  large 
planar  graphs  looks  like  a  splash. 
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Algorithm  3.9:  Splash(n,  W) 

Input:  vertex  v 

Input:  maximum  Splash  size  W 

//  Construct  the  breadth  first  search  ordering  with  W  message 
computations  and  rooted  at  v. 
fifo  •(—[]//  FiFo  Spanning  Tree  Queue 
a  <r-  ( v )  //  Initial  Splash  ordering  is  the  root  v 
AccumW  4—  wv  / /  Accumulate  the  root  vertex  work 
visited  <—  {n} //  Set  of  visited  vertices 
fifo. Enqueue  (A fv) 
while  fifo  is  not  empty  do 
u  4—  fifo.Dequeue  ( ) 

//  If  adding  u  does  not  cause  Splash  to  exceed  limit 
if  AccumW  +  wu  <  W  then 
AccumW  4—  AccumW  +  wu 
Add  u  to  the  end  of  the  ordering  a 
foreach  neighbors  v  £  Mu  do 
if  v  is  not  in  visited  then 

fifo. Enqueue  (v)  / /  Add  to  boundary  of  spanning  tree 
visited  4—  visited  U  {n}  / /  Mark  Visited 

//  Forward  Pass:  sends  messages  from  the  leaves  to  root 
t  foreach  u  £  ReverseOrder  (a)  do 

|  SendMessages (u)  //  Update  priorities  if  necessary. 

II  Backward  Pass:  sends  messages  from  the  root  to  leaves 

2  foreach  u  £  rr  do 

j  SendMessages (it)  //  Update  priorities  if  necessary. 


Table  3.1  we  compute  the  work  associated  with  each  vertex  shown  in  Figure  3.8a. 

The  local  spanning  tree  rooted  at  vertex  F  in  Figure  3.8  is  depicted  by  the  shaded  rectangle  which  grows 
outwards  in  the  sequence  of  figures  Figure  3.8b  through  Figure  3.8e.  The  maximum  splash  size  is  set  to 
W  =  170.  In  Figure  3.8b  the  Splash  contains  only  vertex  F  (the  root)  and  has  total  accumulated  work  of 
w  =  30.  The  vertices  A,  E,  and  F  are  added  in  Figure  3.8c  expanding  the  breadth  first  search  without 
exceeding  W  =  170.  The  effect  of  W  =  170  is  seen  in  Figure  3.8d  vertices  B  and  K  are  added  but  vertex 
G  is  skipped  because  including  G  would  cause  the  Splash  to  exceed  W  =  170.  The  maximum  splash  size 
is  achieve  in  Figure  3.8e  when  vertex  C  is  added  and  no  other  vertices  may  be  added  without  exceeding 
W  =  170.  The  final  splash  ordering  is  a  =  \F.  E,  A,  J,  B,  K,  C], 

Using  the  reverse  of  the  breadth  first  search  ordering  constructed  in  the  first  phase,  the  SendMessages 
operation  is  sequentially  invoked  on  each  vertex  in  the  spanning  tree  (Line  1)  stalling  at  the  leaves,  in 
analogy  to  the  forward  sweep.  The  function  SendMessages  ( v )  updates  all  outbound  messages  from 
vertex  v  using  the  most  recent  inbound  messages.  In  Figure  3.8f,  SendMessages  ( C )  is  invoked  on 
vertex  C  causing  the  messages  mc->B,  and  n  to  be  recomputed.  Finally,  messages  are 

computed  in  the  original  a  ordering  stalling  at  the  root  and  invoking  SendMessages  sequentially  on 
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Figure  3.8:  A  splash  of  splash  size  W  =  170  is  grown  starting  at  vertex  F.  The  Splash  spanning  tree 
is  represented  by  the  shaded  region,  (a)  The  initial  factor  graph  is  labeled  with  the  vertex  work  (wi) 
associated  with  each  vertex,  (b)  The  Splash  begins  rooted  at  vertex  F  with  accumulated  work  w  =  30. 
(c)  The  neighbors  of  F  are  added  to  the  Splash  and  the  accumulated  work  increases  to  w  =  108.  (d) 
The  Splash  expands  further  to  include  vertex  B  and  K  but  does  not  include  vertex  G  because  doing  so 
would  exceed  the  maximum  splash  size  of  W  =  170.  (e)  The  splash  expand  once  more  to  include  vertex 
C  but  can  no  longer  expand  without  exceeding  the  maximum  splash  size.  The  final  Splash  ordering  is 
(7  =  [F.  E,  A ,  J,  B,  K.  C] .  (f)  The  SendMessages  operation  is  invoked  on  vertex  C  causing  the  messages 
mc^B,  and  mc-+D  to  be  recomputed. 


each  vertex,  completing  the  backwards  sweep.  Hence,  with  the  exception  of  the  root  vertex  v,  all  messages 
originating  from  each  vertex  in  a  are  computed  twice,  once  on  the  forwards  sweep  in  Line  1  and  once  in 
the  backwards  sweep  in  Line  2. 

It  is  easy  to  see  that  by  repeatedly  executing  p  parallel  Splashes  (see  Figure  3.7)  of  size  W  =  wn/p 
(where  w  is  the  work  of  updating  a  single  vertex)  placed  evenly  along  the  chain  we  can  recover  the  optimal 
algorithm  Alg.  3.8. 


3.7.3  The  Sequential  SplashBP  Algorithm 

By  combining  the  Splash  operation  with  the  belief  residual  scheduling  to  select  Splash  roots  we  obtain 
the  sequential  version  of  the  SplashBP  algorithm  given  in  Alg.  3.10.  The  belief  residual  of  each  vertex 
is  initialized  to  the  largest  possible  value  (e.g.,  Inf)  to  ensure  that  every  vertex  is  updated  at  least  once. 
The  Splash  operation  is  repeatedly  applied  to  the  vertex  with  highest  priority.  During  the  Splash  operation 
the  priorities  (belief  residuals)  of  all  vertices  in  the  Splash  and  along  the  boundary  are  updated:  after 
SendMessages  is  invoked  on  a  vertex  its  belief  residual  is  set  to  zero  and  the  priority  of  its  neighboring 
vertices  is  increased  by  the  change  in  their  beliefs. 
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Vertex 

Assignments  (A,;) 

Degree  (|A/"i|) 

Neighbor  Costs  (Ewe.V  l-Mol) 

Work  (wi) 

A 

2 

3 

23  +  24  +  22 

34 

B 

22 

2 

2  +  2 

12 

C 

2 

3 

22  +  24  +  22 

30 

D 

22 

2 

2  +  2 

12 

E 

2 

1 

23 

10 

F 

23 

3 

2  +  2  +  2 

30 

G 

24 

4 

2  +  2  +  2  +  2 

72 

H 

2 

2 

22  +  22 

12 

Table  3.1:  The  work  associated  with  each  vertex  in  Figure  3.8a  is  computed  using  Eq.  (3.20).  We  have 
omitted  vertices  I,  J,  K,  and  L  sine  they  arc  equivalent  to  vertices  D,  A,  B,  and  C  respectively. 


Algorithm  3.10:  The  Sequential  Splash  Algorithm 
Input:  Constants:  maximum  Splash  size  W,  termination  bound  3 

Q  <—  InitializeQueue ( Q ) 

Set  All  Residuals  to  oo 

while  TopResid(<5)  >  3  do 
v  Top  ( Q ) 

Splash  (Q,  v,  W)  //  Priorities  updated  after  SendMessages 


3.7.4  Evaluating  SplashBP  in  the  Sequential  Setting 

To  demonstrate  the  performance  gains  of  the  sequential  SplashBP  algorithm  on  chain  structured  models  we 
constructed  a  set  of  synthetic  chain  graphical  models  and  evaluated  the  running  time  on  these  models  for  a 
fixed  convergence  criterion  while  scaling  the  size  of  the  Splash  in  Figure  3.9a  and  while  scaling  the  size  of 
the  chain  in  Figure  3.9b.  Each  chain  consists  of  binary  random  variables  with  weak  random  node  potentials 
and  strong  attractive  edge  potentials.  As  the  size  of  the  Splash  increases  (Figure  3.9a)  the  total  number  of 
updates  on  the  chain  decreases  reflecting  the  optimality  of  the  underlying  forward-backward  structure  of 
the  Splash. 

In  Figure  3.9b  we  compare  the  running  time  of  Splash  with  Synchronous,  Round-Robin,  Residual,  and 
Wildfire  BP  algorithms  as  we  increase  the  size  of  the  model.  The  conventional  Synchronous  and  Round- 
Robin  algorithms  arc  an  order  of  magnitude  slower  than  Wildfire,  ResiduaBP,  and  Splash  and  scale  poorly 
forcing  separate  comparisons  for  readability.  Nonetheless,  in  all  cases  the  Splash  algorithm  is  substantially 
faster  and  demonstrates  better  scaling  with  increasing  model  size. 

The  running  time,  computational  efficiency,  and  accuracy  of  the  SplashBP  algorithm  were  evaluated  in 
the  single  processor  setting.  In  Figure  3.10a,  Figure  3.10b,  and  Figure  3.10c  we  plot  the  average  belief 
accuracy,  worst  case  belief  accuracy,  and  prediction  accuracy  against  the  runtime  on  a  subset  of  the  UAI 
2008  Probabilistic  Inference  Evaluation  dataset  [Darwiche  et  ah,  2008].  We  ran  each  algorithm  to  3  =  10~5 
convergence  and  report  the  runtime  in  seconds  and  the  marginal  estimates  for  all  variables.  We  compared 
against  the  exact  marginals  obtained  using  Ace  2.0  [Fluang  et  ah,  2006].  In  all  cases  the  Splash  algorithm 
obtained  the  most  accurate  belief  estimates  in  the  shortest  time.  The  dynamical  scheduled  algorithms, 
Wildfire  and  Residual  BP,  consistently  outperformed  the  statically  scheduled  algorithms,  Round-Robin  and 
Synchronous  BP.  We  also  assessed  accuracy  on  a  protein  side  chain  prediction  task  [Yanover  et  ah,  2007] 
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(a)  Runtime  versus  Splash  Size 
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(b)  Runtime  versus  Chain  Size 


Figure  3.9:  In  this  figure,  we  plot  the  number  of  vertex  updates  on  a  randomly  generated  chain  graphical 
model.  Run-time  to  convergence  is  measured  in  vertex  updates  rather  than  wall  clock  time  to  ensure  a  fair 
algorithmic  comparison  and  eliminate  hardware  and  implementation  effects  which  appear  at  the  extremely 
short  run-times  encountered  on  these  simple  models,  (a)  The  number  of  vertex  updates  made  by  Sequential 
Splash  BP,  fixing  the  chain  length  to  1000  variables,  and  varying  the  Splash  size,  (b)  The  number  of 
vertex  updates  made  by  various  BP  algorithms  varying  the  length  of  the  chain.  Two  plots  are  used  to 
improve  readability  since  the  Splash  algorithm  is  an  order  of  magnitude  faster  than  the  Synchronous  and 
Round-Robin  algorithms  however  the  Splash  algorithm  curve  is  the  same  for  both  plots. 


where  the  goal  is  to  estimate  the  orientations  of  each  side  chain  in  a  protein  structure.  Here  we  find  that  all 
belief  propagation  algorithms  achieve  roughly  the  same  prediction  accuracy  of  73%  for  xi  and  X'2  angles 
which  is  consistent  with  the  results  of  Yanover  et  al.  [2007]. 

Not  only  does  SplashBP  converge  faster  than  other  belief  propagation  algorithms,  it  also  converges 
more  often.  We  assessed  the  convergence  of  the  SplashBP  algorithm  using  several  different  metrics. 
In  Figure  3.11a  we  plot  the  number  of  protein  networks  that  have  converged  ( (3  =  10~5)  against  the 
run-time.  In  Figure  3.11b  we  plot  the  number  of  protein  networks  that  have  converged  against  the  number 
of  message  computations.  Again,  we  see  that  SplashBP  converges  sooner  than  other  belief  propagation 
algorithms. 

In  Figure  3.12a  and  Figure  3.12b  we  directly  compare  the  running  time  and  work  of  the  SplashBP  algorithm 
against  other  common  scheduling  strategies  on  a  set  of  challenging  protein-protein  interaction  networks 
obtained  from  Taskar  et  al.  [2004b],  The  results  are  presented  for  each  of  the  8  networks  separately  with 
bars  in  the  order  Splash ,  Residual ,  Wildfire ,  Round-Robin,  and  Synchronous.  All  algorithms  converged 
on  all  networks  except  residual  belief  propagation  which  failed  to  converge  on  network  five.  In  all  cases 
the  Splash  algorithm  achieves  the  shortest  running  time  and  is  the  in  the  bottom  two  in  total  message 
computations.  Since  the  Splash  algorithm  favors  faster  message  computations  (lower  degree  vertices),  it 
is  possible  for  the  Splash  algorithm  is  able  to  achieve  a  shorter  runtime  even  while  doing  slightly  more 
message  computations. 


3.7.5  Parallelizing  the  Splash  Algorithm 

We  easily  parallelize  the  SplashBP  algorithm  (Alg.  3.11)  by  executing  multiple  Splash  operations  in  parallel 
generalizing  Alg.  3.8.  Each  of  the  p  processor  executes  it  own  separate  Splash  operation  on  the  top  p 
highest  priority  vertices. 
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(a)  Average  Li  Error  (b)  Maximum  L\  Error  (c)  MAP  Accuracy 


Figure  3.10:  We  assessed  the  accuracy  of  Splash  algorithm  using  the  exact  inference  challenge  networks 
from  Darwiche  et  al.  [2008]  as  well  as  the  protein  side  chain  prediction  networks  obtained  from  Yanover 
et  al.  [2007].  In  (a)  and  (b)  we  plot  the  average  and  max  L\  error  in  the  belief  estimates  for  all  variables  as 
a  function  of  the  running  time.  In  (c)  we  plot  the  prediction  accuracy  of  the  MAP  estimates  as  a  function  of 
the  running  time.  In  all  cases  we  find  that  Splash  belief  propagation  achieves  the  greatest  accuracy  in  the 
least  time. 


Algorithm  3.11:  Parallel  Splash  Belief  Propagation  Algorithm 
Input:  Constants:  maximum  Splash  size  W,  termination  bound  (3 
Q  <—  InitializeQueue ( Q ) 

Set  All  Residuals  to  oo 
t  forall  processors  do  in  parallel 
while  TopResidual(Q)  >  13  do 
v  <—  Pop(Q)  //  Atomic 

Splash  (Q,  v,  W)  //  Updates  vertex  residuals 
(5-Push((n,  Residual(n)))  / /  Atomic 


While  we  do  not  require  that  parallel  Splash  operations  contain  disjoint  sets  of  vertices,  we  do  require  that 
each  Splash  has  a  unique  root  which  is  achieved  through  a  shared  priority  queue  and  atomic  Push  and 
Pop  operations. 

To  achieve  maximum  parallel  performance  the  Parallel  Splash  algorithm  relies  on  an  efficient  par¬ 
allel  scheduling  queue  to  minimize  processor  locking  and  sequentialization  when  Push,  Pop,  and 
UpdatePriority  are  invoked.  While  there  is  considerable  work  from  the  parallel  computing  commu¬ 
nity  on  efficient  parallel  queue  data  structures,  we  find  that  in  the  shared  memory  setting  basic  locking 
queues  provide  sufficient  performance.  In  the  distributed  Splash  algorithm  discussed  in  Section  3.8  we 
employ  a  distributed  scheduling  queue  in  a  work  balanced  manner  to  eliminate  processor  contention. 


3.7.6  Parallel  Splash  Optimality  on  Chains 

We  now  show  that  the  parallel  SplashBP  algorithm  using  belief  residual  scheduling  achieves  the  opti¬ 
mal  running  time  for  chain  graphical  models.  We  begin  by  relating  the  Splash  operation  to  the  vertex 
residuals. 

Lemma  3.7.2  (Splash  Residuals).  Immediately  after  the  Splash  operation  is  applied  to  an  acyclic  graph 
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Figure  3.11:  The  Splash  algorithm  demonstrates  faster  and  more  consistent  convergence  than  other  baseline 
algorithms  on  a  single  processor.  In  the  number  of  converged  (/?  =  10-5)  networks  (out  of  276)  is  plotted 
against  the  runtime  (a)  and  number  of  message  calculations  (b). 


all  vertices  interior  to  the  Splash  have  zero  belief  residual. 

Proof.  The  proof  follows  naturally  from  the  convergence  of  BP  on  acyclic  graphs.  The  Splash  operation 
runs  BP  to  convergence  on  the  sub-tree  contained  within  the  Splash.  Therefore  the  marginal  estimates  at 
each  variable  will  have  converged  and  the  corresponding  belief  residuals  will  be  zero.  □ 

After  a  Splash  is  completed,  the  residuals  associated  with  vertices  interior  to  the  Splash  are  propagated  to 
the  exterior  vertices  along  the  boundary  of  the  Splash.  Repeated  application  of  the  Splash  operation  will 
continue  to  move  the  boundary  residual  leading  to  Lemma  3.7.3. 

Lemma  3.7.3  (Basic  Convergence).  Given  a  chain  graph  where  only  one  vertex  has  nonzero  residual,  the 
Splash  algorithm  with  Splash  size  W  will  run  in  0(re  +  W )  time. 

Proof.  When  the  first  Splash,  originating  from  the  vertex  with  nonzero  residual  is  finished,  the  interior  of 
the  Splash  will  have  zero  residual  as  stated  in  Lemma  3.7.2,  and  only  the  boundary  of  the  Splash  will  have 
non-zero  residual.  Because  all  other  vertices  initially  had  zero  residual  and  messages  in  opposite  directions 
do  not  interact,  each  subsequent  Splash  will  originate  from  the  boundary  of  the  region  already  covered  by 
the  previous  Splash  operations.  By  definition  the  convergence  criterion  is  achieved  after  the  high  residual 
messages  at  the  originating  vertex  propagate  a  distance  t€.  However,  because  the  Splash  size  is  fixed,  the 
Splash  operation  may  propagate  messages  an  additional  W  vertices.  □ 

If  we  set  the  initial  residuals  to  ensure  that  the  first  p  parallel  Splashes  are  uniformly  spaced,  SplashBP 
obtains  the  optimal  lower  bound. 

Theorem  3.7.4  (SplashPB  Optimality  on  Chain  Models).  Given  a  chain  graph  with  n  vertices  and  p  <  n 
processors  we  can  apply  the  Splash  algorithm  with  the  Splash  size  set  toW  =  nip  and  uniformly  spaced 
initial  Splashes  to  obtain  a  tv  -approximation  in  expected  running  time 
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x  10' 


(a)  Runtime  (b)  Work 

Figure  3.12:  We  assessed  runtime  and  convergence  on  eight  different  protein-protein  interaction  networks 
obtained  from  Taskar  et  al.  [2004b].  In  (a)  and  (b)  we  plot  the  runtime  in  seconds  and  the  work  in  message 
computations  for  each  algorithm  on  each  network.  Each  bar  represents  a  different  algorithm  in  the  order 
Splash ,  Residual ,  Wildfire,  Round-Robin,  and  Synchronous  from  left  (darkest)  to  right  (lightest). 


Proof  of  Theorem  3.7.4.  We  set  every  n/p  vertex  {Xn/2p,  X3n/2p,  X5n/2p, . . .}  to  have  slightly  higher 
residual  than  all  other  vertices  forcing  the  first  p  Splash  operations  to  staid  on  these  vertices.  Since  the  size 
of  each  splash  is  also  W  =  n/p,  all  vertices  will  be  visited  in  the  first  p  splashes.  Specifically,  we  note  that 
at  each  Splash  only  produces  two  vertices  of  non-zero  residual  (see  Lemma  3.7.2).  Therefore  there  arc  at 
most  O  ip)  vertices  of  non- zero  residual  left  after  the  first  p  Splashes. 

To  obtain  an  upper  bound,  we  consider  the  runtime  obtained  if  we  independently  compute  each  re  subtree 
rooted  at  a  vertex  of  non-zero  residual.  This  is  an  upper  bound  as  we  observe  that  if  a  single  Splash  overlaps 
more  than  one  vertex  of  non-zero  residual,  progress  is  made  simultaneously  on  more  than  one  subtree  and 
the  running  time  can  only  be  decreased. 

From  Lemma  3.7.2,  we  see  that  the  total  number  of  updates  needed  including  the  initial  O  (p)  Splash 
operations  is  0(p(re  +  W ) )  +  0(n )  =  0(n  +pre).  Since  work  is  evenly  distributed,  each  processor 
performs  0(n/p  +  re)  updates.  □ 

In  practice,  when  the  graph  structure  is  not  a  simple  chain  graph,  it  may  be  difficult  to  evenly  space  Splash 
operations.  By  randomly  placing  the  initial  Splash  operations  we  can  obtain  a  factor  log(p)  approximation 
in  expectation. 

Corollary  3.7.5  (SplashBP  with  Random  Initialization).  If  all  residuals  are  initialized  to  a  random  value 
greater  than  the  maximum  residual,  the  total  expected  running  time  is  at  most  0(\og{p)(n/p  +  re)). 

Proof.  Partition  the  chain  graph  into  p  blocks  of  size  n/p.  If  a  Splash  originates  in  a  block  then  it  will 
update  all  vertices  interior  to  the  block.  The  expected  time  to  Splash  (collect)  all  p  blocks  is  upper  bounded4 
by  the  coupon  collectors  problem.  Therefore,  at  most  O  (plog(p))  Splash  operations  (rather  than  the  p 
Splash  operations  used  in  Theorem  3.7.4)  are  required  in  expectation  to  update  each  vertex  at  least  once. 
Using  the  same  method  as  in  Theorem  3.7.4,  we  observe  that  the  running  time  is  O  (log(p)  ( n/p  +  re)).  □ 

4Since  candidate  vertices  are  removed  between  parallel  rounds,  there  should  be  much  fewer  collection  steps  than  the  analogous 
coupon  collectors  problem. 
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Algorithm  3.12:  DynamicSplashfQ,  v,  W ) 


Input:  scheduling  queue  Q,  vertex  v,  maximum  splash  size  W 
fifo  A-  []  //  FiFo  Spanning  Tree  Queue 
a A-  (v)  //  Initial  Splash  ordering  is  the  root  v 
AccurnW  A-  wv  //  Accumulate  the  root  vertex  work 
visited  A-  {n} //  Set  of  visited  vertices 
fifo. Enqueue  (A fv) 
while  fifo  is  not  empty  do 
u  A-  fifo.Dequeue  ( ) 
ii  AccurnW  +  wu  <  W  then 
AccurnW  A-  AccurnW  +  wu 
Add  u  to  the  end  of  the  ordering  a 
foreach  neighbors  v  £  Mu  do 

if  Belief  Residual  ofv  is  greater  than  /3  and  is  not  in  visited  then 
fifo. Enqueue  {v)  / /  Add  boundary 
visited  A-  visited  U  {n}  / /  Mark  Visited 


L 


foreach  u  £  ReverseOrder  (a)  do 
SendMessages (u) 

Q.SetPriority  (u,  0)  //  Set  Belief  residual  to  zero 
foreach  v  £  A fu  do  Q.UpdatePriority  (v) 

foreach  a  £  a  do 

SendMessages  (it) 

Q.SetPriority  (u,  0)  //  Set  Belief  residual  to  zero 
foreach  v  £  A fu  do  Q.UpdatePriority  (v) 


3.7.7  Dynamically  Tuning  the  Splash  Operation 

A  weakness  of  the  Splash  operation  is  that  it  requires  selecting  the  Splash  size  parameter  W  which  affects 
the  overall  performance.  If  the  Splash  size  is  too  large  then  the  algorithm  will  be  forced  to  recompute 
messages  that  have  already  converged.  Alternatively,  if  the  Splash  size  is  set  too  small  then  we  do  not  exploit 
the  local  sequential  structure.  To  address  this  weakness  introduced  Dynamic  Splashes  which  substantially 
improve  performance  in  practice  and  eliminate  the  need  to  tune  the  Splash  size  parameter. 

Dynamic  Splashes  (Alg.  3.12)  automatically  adjust  the  size  and  shape  of  each  Splash  by  adopting  the 
Wildfire  BP  heuristic  in  the  sub-tree  construction  procedure  of  the  Splash  operation.  During  the  initial 
breadth  first  search  phase  of  the  Dynamic  Splash  operation  to  excludes  vertices  with  belief  residuals  below 
the  termination  condition.  This  ensures  that  we  do  not  recompute  messages  that  have  already  converged, 
and  more  importantly  allows  the  Splash  operation  to  dynamically  adapt  to  the  local  convergence  patterns  in 
the  graph.  In  addition,  as  the  algorithm  converges,  low  belief  residuals  automatically  prune  the  breadth 
first  search  tuning  the  size  of  each  Splash  and  eliminating  the  need  to  carefully  choose  the  Splash  size 
W.  Instead  we  can  fix  Splash  size  to  a  relatively  large  fraction  of  the  graph  (e.g.,  n/p)  and  let  pruning 
automatically  decrease  the  Splash  size  as  needed. 

In  Figure  3.13a  we  plot  the  running  time  of  the  Parallel  Splash  algorithm  with  different  Splash  sizes  W 
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(a)  Automatic  Splash  Size  Tuning 


(b)  Irregular  Splash  Shape 


(c)  Execution  of  the  Splash  Algorithm  on  the  Denoising  Task 


Figure  3.13:  (a)  The  running  time  of  the  Splash  algorithm  using  various  different  Splash  sizes  with  and 
without  pruning,  (b)  The  vertices  with  high  belief  residual, shown  in  black,  are  included  in  the  Splash  while 
vertices  with  belief  residual  below  the  termination  threshold,  shown  in  gray  are  excluded,  (c)  To  illustrate 
the  execution  of  the  Splash  BP  algorithm  we  ran  it  on  a  simple  image  denoising  task  and  took  snapshots 
of  the  program  state  at  four  representative  points  (phases)  during  the  execution.  The  cumulative  vertex 
updates  (number  of  times  SendMessages  was  invoked  since  the  last  snapshot)  are  plotted  with  brighter 
regions  being  updates  more  often  than  darker  regions.  Initially,  large  regular  (rectangular)  Splashes  are 
evenly  spread  over  the  entire  model.  As  the  algorithm  proceeds  the  Splashes  become  smaller  and  more 
irregular  focusing  on  the  challenging  regions  along  the  boundaries  of  the  underlying  image. 


both  with  and  without  Splash  pruning  enabled.  With  Splash  pruning  disabled  there  is  clear  optimal  Splash 
size.  However  with  Splash  pruning  enabled,  increasing  the  size  of  the  Splash  beyond  the  optimal  size  does 
not  reduce  the  performance.  In  Figure  3.13c  we  illustrate  examples  of  the  Splashes  at  various  phases  of 
the  algorithm  on  the  image  denoising  task  (see  Figure  2.1).  Initially  the  Splashes  are  relatively  large  and 
uniform  but  as  the  algorithm  converges  the  Splashes  shrink  and  adapt  to  the  local  shape  of  the  remaining 
non-converged  regions  in  the  model. 


3.7.8  Implementing  SplashBP  in  the  Shared  Memory  Setting 

The  parallel  SplashBP  algorithm  presented  in  the  previous  section  (Alg.  3.11)  can  easily  be  mapped  to  the 
shared  memory  setting  described  in  Section  2.2.1.  Because  the  shared  memory  setting  ensures  relatively 
low  latency  access  to  all  memory  it  is  not  necessary5  to  partition  the  graph  over  the  processors.  However, 
because  multiple  processors  can  read  and  modify  the  same  message,  we  must  ensure  that  messages  and 

5  It  is  possible  that  in  an  extreme  NUMA  setting  partitioning  could  improve  performance  and  in  this  case  we  advocate  using 
ideas  from  the  distributed  implementation. 
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beliefs  are  manipulated  in  a  consistent  manner  which  is  accomplished  using  the  SendMessages  locking 
scheme  described  in  Alg.  3.5. 


Number  of  Processors 
(a)  Runtime 


(b)  Speedup 


x  10 


Number  of  Processors 
(c)  Work 


(d)  Comp.  Efficiency 


Figure  3.14:  Multicore  results  for  a  Protein-Protein  Interaction  Networks  [Taskar  et  al.,  2004a] 


In  Figure  3.14  we  compare  each  algorithm  in  terms  of  runtime,  speedup,  work,  and  efficiency  as  a  function 
of  the  number  of  cores.  All  experiments  were  run  on  protein-protein  interaction  networks  obtained  from 
Taskar  et  al.  [2004b]  with  over  14K  binary  variables,  20K  factors,  and  50K  edges.  The  runtime,  shown  in 
Figure  3.14a,  is  measured  in  seconds  of  elapsed  wall  clock  time  before  convergence.  On  all  of  the  networks 
the  SplashBP  algorithm  is  the  fastest.  The  static  scheduling  algorithms  (Round-Robin  and  Synchronous 
BP)  are  consistently  slower  than  the  dynamic  scheduling  algorithms  (Residual  and  Widlhre  BP). 

The  Splash  algorithm  demonstrates  nearly  linear  scaling.  In  Figure  3.14b  we  plot  the  strong  scaling 
of  each  algorithm  measured  relative  to  the  fastest  single  processor  algorithm.  As  a  consequence  of  the 
relative-to-best  scaling,  inefficient  algorithms  may  exhibit  a  speedup  less  than  one.  Furthermore,  we  again 
see  a  consistent  pattern  in  which  the  dynamic  scheduling  algorithms  dramatically  outperform  the  static 
scheduling  algorithms.  The  inefficiency  in  the  static  scheduling  algorithms  is  so  dramatic  that  the  parallel 
valiants  seldom  achieve  more  than  a  factor  of  2  speedup  using  16  processors. 

We  measured  algorithm  work  in  Figure  3.14c,  in  terms  of  the  number  of  message  calculations  before 
convergence.  The  total  work,  which  is  a  measure  of  algorithm  efficiency,  should  be  as  small  as  possible 
and  not  depend  on  the  number  of  processors.  We  find  that  the  Splash  algorithm  generally  does  the  least 
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Algorithm  3.13:  The  Distributed  BP  Algorithm 


l  B  4—  Parti tion(G, p)  / /  Partition  the  graph  over  processors 

DistributeGraph(G\  B) 

forall  b  e  B  do  in  parallel 
I  repeat 


2 


/ /  Perform  BP  Updates  according  to  local  schedule 

R  u  n  B  PA  I  go  rit  h  n  l  O  n  ( b) 

RecvExternalMsgs ( )  //  Receive  and  integrate  messages 
SendExternalMsgs ( )  //  Transmit  boundary  messages 
until  not  Converged()  / /  Distributed  convergence  test 


work  and  that  the  number  of  processors  has  minimal  impact  on  the  total  work  done. 

Finally,  we  assessed  computation  efficiency  (Figure  3.14d)  by  computing  the  number  of  message  calcula¬ 
tions  per  processor-second.  The  computational  efficiency  is  a  measure  of  the  raw  throughput  of  message 
calculations  during  inference.  Ideally,  the  efficiency  should  remain  as  high  as  possible.  The  computational 
efficiency,  captures  the  cost  of  building  spanning  trees  in  the  Splash  algorithm  or  frequently  updating 
residuals  in  the  residual  algorithm.  The  computational  efficiency  also  captures  concurrency  costs  incurred 
at  locks  and  barriers.  In  all  cases  we  see  that  the  Splash  algorithm  is  considerably  more  computationally 
efficient.  While  it  is  tempting  to  conclude  that  the  implementation  of  the  Splash  algorithm  is  more  opti¬ 
mized,  all  algorithms  used  the  same  message  computation,  scheduling,  and  support  code  and  only  differ 
in  the  core  algorithmic  structure.  Hence  it  is  surprising  that  the  Splash  algorithm,  even  with  the  extra 
overhead  associated  with  generating  the  spanning  tree  in  each  Splash  operations.  However,  by  reducing  the 
frequency  of  queue  operations  and  the  resulting  lock  contention,  by  increasing  cache  efficiency  through 
message  reuse  in  the  forward  and  backward  pass,  and  by  favoring  lower  work  high  residual  vertices  in  each 
spanning  tree,  the  Splash  algorithm  is  able  to  update  more  messages  per  processor-second. 


3.8  Belief  Propagation  Algorithms  on  Distributed  Architectures 


In  contrast  to  the  shared  memory  setting  where  each  processor  has  symmetric  fast  access  to  the  entire 
program  state,  in  the  distributed  memory  setting  each  process  can  only  directly  access  its  local  memory 
and  must  pass  messages  to  communicate  with  other  processors.  Therefore  efficient  distributed  parallel 
algorithms  arc  forced  to  explicitly  distribute  the  program  state. 

While  the  distributed  memory  setting  can  be  more  challenging,  it  offers  the  potential  for  linear  scaling 
in  memory  capacity  and  memory  bandwidth.  To  fully  realize  the  linear  scaling  in  memory  capacity  and 
bandwidth  we  must  partition  the  program  state  in  a  way  that  places  only  a  fraction  1/p  of  the  global 
program  state  on  each  processor.  Evenly  distributing  computation  and  storage  while  minimizes  across 
machine  coordination  is  the  central  challenge  in  the  design  of  scalable  distributed  parallel  algorithms. 
While  the  distributed  setting  often  considers  systems  with  network  and  processor  failure,  in  this  section  we 
assume  that  the  all  resources  remain  available  throughout  execution  and  that  all  messages  eventually  reach 
their  destination. 

In  Alg.  3.13  we  present  a  generic  distributed  BP  algorithm  composed  of  partitioning  phase  (Line  1)  after 
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which  each  processor  repeatedly  executes  a  BP  update  (Line  2)  using  a  local  schedule  followed  by  inter¬ 
processor  communication  and  a  distributed  convergence  test  (Line  3).  To  achieve  balance  computation  and 
communication  we  employ  weighted  graph  partitioning  and  over-partitioning  which  we  will  now  describe. 
As  it  is  often  impractical  to  maintain  an  exact  schedules  across  distributed  memory,  relaxed  versions  are 
discussed  in  Section  3.8.2.  Finally,  we  address  the  challenges  of  distributed  convergence  assessment  in 
Section  3.8.3. 

3.8.1  Partitioning  the  Factor  Graph  and  Messages 

To  distribute  the  state  of  the  program  among  p  processors,  we  begin  by  partitioning  the  graphical  model 
and  BP  messages  (parameters).  To  maximize  throughput  and  hide  network  latency,  we  must  minimize 
inter-processor  communication  and  ensure  that  the  data  needed  for  message  computations  are  locally 
available.  We  define  a  partitioning  of  the  factor  graph  over  p  processors  as  a  set  B  =  {B i, ...,  Bp }  of 
disjoint  sets  of  vertices  B^  C  V  such  that  Jp= ,  Bp-  =  V.  Given  a  partitioning  B  we  assign  all  the  factor 
data  associated  with  /,  £  B^  to  the  kth  processor.  Similarly,  for  all  (both  factor  and  variable)  vertices 
i  £  Bj.  we  store  the  associated  belief  and  all  inbound  messages  on  the  processor  k.  Each  vertex  update  is 
therefore  a  local  procedure.  For  instance,  if  vertex  i  is  updated,  the  processor  owning  vertex  i  can  read 
factors  and  all  incoming  messages  without  communication.  To  maintain  the  locality  invariant,  updated 
messages  destined  to  remote  vertices  are  immediately  transmitted  across  the  network  to  the  processors 
owning  the  destination  vertices. 

To  minimize  communication  and  ensure  balanced  storage  and  computation,  we  frame  the  minimum 
communication  load  balancing  objective  in  terms  of  a  classic  balanced  graph  partitioning: 

mjn  Y  Y  (ui  +  Uj)\Xi\ 

BeB  (ieB,j(£B)eE 

subj.  to:  MB  £  B  Y^  UiWi  <  —  ^  Uvwv 

i£B  P  vGv 

where  Ui  is  the  number  of  times  SendMessages  is  invoked  on  vertex  i  and  Wi  is  the  vertex  work  defined 
in  Eq.  (3.20),  and  7  >  I  is  the  balance  coefficient.  The  objective  in  Eq.  (3.21)  minimizes  communication 
while  the  constraint  in  Eq.  (3.22)  ensures  work  balance  for  small  7  and  is  commonly  referred  to  as  the  k- way 
balanced  cut  objective  which  is  unfortunately  iVP-Hard  in  general.  However,  there  are  several  popular 
graph  partitioning  libraries  such  as  METIS  [Karypis  and  Kumar,  1998]  and  Chaco  [Hendrickson  and 
Leland,  1994],  which  quickly  produce  reasonable  approximations.  Here  we  use  the  collection  of  multilevel 
graph  partitioning  algorithms  in  the  METIS  graph  partitioning  library.  These  algorithms,  iteratively  coarsen 
the  underlying  graph,  apply  high  quality  partitioning  techniques  to  the  small  coarsened  graph,  and  then 
iteratively  refine  the  coarsened  graph  to  obtain  a  high  quality  partitioning  of  the  original  graph.  While 
there  arc  no  theoretical  guarantees,  these  algorithms  have  been  shown  to  perform  well  in  practice  and  arc 
commonly  used  for  load  balancing  in  parallel  computing. 

In  the  case  of  static  schedules,  every  vertex  is  updated  the  same  number  of  times  (Ui  =  Uj  :  Vi,  j.) 
and  therefore  U  can  be  eliminated  from  both  the  objective  and  the  constraints.  Unfortunately,  dynamic 
algorithms  (Wildfire,  Residual,  and  Splash  BP)  the  update  counts  Ui  for  each  vertex  are  neither  fixed 
or  known.  Furthermore,  the  update  counts  are  difficult  to  estimate  since  they  depend  on  the  graph 
structure,  factors,  and  progress  towards  convergence.  Consequently,  for  dynamic  algorithms,  we  advocate  a 
randomized  load  balancing  technique  based  on  over-partitioning,  which  does  not  require  knowledge  of  the 
number  of  updates  Ur. 


(3.21) 

(3.22) 
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(a)  Original  Image  (b)  Noisy  Image  (c)  Factor  Graph 


(d)  Update  Counts 


(e)  Uninformed  Part.  (f)  Informed  Part. 


(g)  Update  Distribution 


(h)  Update  Prediction 


Figure  3.15:  We  use  the  synthetic  denoising  task  (also  used  in  Figure  3.13)  to  illustrate  the  difficult  in 
estimating  the  update  patterns  of  Splash  belief  propagation,  (a)  The  original  synthetic  sunset  image  which 
was  specifically  created  to  have  an  irregular  update  pattern,  (b)  The  synthetic  image  with  Gaussian  noise 
added,  (c)  Factor  graph  model  for  the  true  values  for  the  underlying  pixels.  The  latent  random  variable 
for  each  pixel  is  represented  by  the  circles  and  the  observed  values  are  encoded  in  the  unary  factors 
drawn  as  shaded  rectangles,  (d)  The  update  frequencies  of  each  variable  plotted  in  log  intensity  scale  with 
brighter  regions  updated  more  frequently,  (e)  Uniformed  U,  =  1  balanced  partitioning,  (f)  The  informed 
partitioning  using  the  true  update  frequencies  after  running  Splash.  Notice  that  the  informed  partitioning 
assigns  fewer  vertices  per  processors  on  the  top  half  of  the  image  to  compensate  for  the  greater  update 
frequency,  (g)  The  distribution  of  vertex  update  counts  for  an  entire  execution,  (h)  Update  counts  from 
first  the  half  of  the  execution  plotted  against  update  counts  from  the  second  half  of  the  execution.  This  is 
consistent  with  phases  of  execution  described  in  Figure  3.13c. 


To  illustrate  the  difficulty  involved  in  estimating  the  update  counts  for  each  vertex,  we  again  use  the 
synthetic  denoising  task.  The  input,  shown  in  Figure  3.15a,  is  a  grayscale  image  with  independent  Gaussian 
noise  N  (0,  <r2)  added  to  each  pixel.  The  factor  graph  (Figure  3.15c)  corresponds  to  the  pairwise  grid 
Markov  Random  Field  constructed  by  introducing  a  latent  random  variable  for  each  pixel  and  connecting 
neighboring  variables  by  factors  that  encode  a  similarity  preference.  We  also  introduce  single  variable 
factors  which  represent  the  noisy  pixel  evidence.  The  synthetic  image  was  constructed  to  have  a  nonuniform 
update  pattern  (Figure  3.15d)  by  making  the  top  half  more  irregular  than  the  bottom  half.  The  distribution 
of  vertex  update  frequencies  (Figure  3.15g)  for  the  denoising  task  is  nonuniform  with  a  few  vertices  being 
updated  orders  of  magnitude  more  frequently  than  the  rest.  The  update  patterns  is  temporally  inconsistent 
frustrating  attempts  to  estimate  future  update  counts  using  past  behavior  (Figure  3.15h). 
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Uninformed  Partitioning 


Given  the  inability  to  predict  the  update  counts  Ui  one  might  conclude,  as  we  did,  that  sophisticated 
dynamic  graph  partitioning  is  necessary  to  effectively  balance  computation  and  minimize  communication. 
Surprisingly,  we  find  that  an  uninformed  cut  obtained  by  setting  the  number  of  updates  to  a  constant  (i.e., 
Ui  =  1)  yields  a  partitioning  with  comparable  communication  cost  and  work  balance  as  those  obtained 
when  using  the  true  update  counts.  We  designate  B  as  the  partitioning  that  optimizes  the  objectives  in 
Eq.  (3.21)  and  Eq.  (3.22)  where  we  have  assumed  U{  =  1.  In  Table  3.2  we  construct  uninformed  cuts 
(p  =  120)  for  several  factor  graphs  and  report  the  communication  cost  and  balance  defined  as: 

J2bgI3  Th(u&B,v£B)&E  wuv 

J2bgB*  Yj(u€B,vi£B)€E  Wuv 

— - max  y  wv 

z2vev  wv  b&b  “ 

relative  to  the  ideal  cut  B*  obtained  using  the  true  update  counts  Ur.  We  find  that  uninformed  cuts  have 
lower  communication  costs  at  the  expense  of  increased  imbalance.  This  discrepancy  arises  from  the  need 
to  satisfy  the  balance  requirement  with  respect  to  the  true  Ut  at  the  expense  of  a  higher  communication 
cost. 


Rel.  Com.  Cost  = 
Rel.  Work  Balance  = 


Graph 

Rel.  Com.  Cost 

Rel.  Work  Balance 

denoise 

0.980 

3.44 

uw-systems 

0.877 

1.837 

uw-languages 

1.114 

2.213 

cora- 1 

1.039 

1.801 

Table  3.2:  Comparison  of  uninformed  and  informed  partitionings  with  respect  to  communication  cut  cost 
and  work  balance.  While  the  communication  costs  of  the  uniformed  partitionings  arc  comparable  to  those 
of  the  informed  partitionings,  the  work  imbalance  of  the  uninformed  cut  is  typically  greater. 


Over-Partitioning 

If  the  graph  is  partitioned  assuming  constant  update  counts,  there  could  be  work  imbalance  if  BP  is  executed 
using  a  dynamic  schedule.  For  instance,  a  frequently  updated  sub-graph  could  be  placed  within  a  single 
partition  as  shown  in  Figure  3.16a.  To  decrease  the  chance  of  such  an  event,  we  can  over-partition  the 
graph,  as  shown  in  Figure  3.16b,  into  k  xp  balanced  partitions  and  then  randomly  redistribute  the  partitions 
to  the  original  p  processors. 

By  partitioning  the  graph  more  finely  and  randomly  assigning  regions  to  different  processor,  we  more  evenly 
distribute  nonuniform  update  patterns  improving  the  overall  work  balance.  Flowever,  over-partitioning  also 
increases  the  number  of  edges  crossing  the  cut  and  therefore  the  communication  cost.  We  demonstrate 
this  effect  on  the  synthetic  denoising  problem  in  Figure  3.17a  and  Figure  3.17b.  We  also  observe  that  in 
the  limit  of  k  =  n/p,  the  over  partitioning  strategy  corresponds  exactly  to  randomly  distributing  vertices 
over  the  processors.  This  is  intuitively,  the  most  balanced  partitioning  strategy,  but  also  incurs  the  greatest 
amount  of  communication  overhead. 
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(a)  Denoise  Uniformed  Cut 


(b )  Denoise  Over-Partitioning 


Figure  3.16:  Over-partitioning  can  help  improve  work  balance  by  more  uniformly  distributing  the  graph 
over  the  various  processors,  (a)  A  two  processor  uninformed  partitioning  of  the  denoising  factor  graph  can 
lead  to  one  processor  (CPU1)  being  assigned  most  of  the  work,  (b)  Over-partitioning  by  a  factor  of  6  can 
improve  the  overall  work  balance  by  assigning  regions  from  the  top  and  bottom  of  the  denoising  image  to 
both  processors. 


Choosing  the  optimal  over-partitioning  factor  k  is  challenging  and  depends  heavily  on  hardware,  graph 
structure,  and  even  factors.  In  situations  where  the  algorithm  may  be  run  repeatedly,  standard  search 
techniques  may  be  used.  We  find  that  in  practice  a  small  factor,  e.g.,  k  =  5  is  typically  sufficient. 

The  over-partitioning  technique  may  be  justified  theoretically.  In  particular,  when  using  a  recursive  bisection 
style  partitioning  algorithm  where  the  true  work  split  at  each  step  is  an  unknown  random  variable,  there 
exists  a  theoretical  bound  on  the  ideal  size  of  k.  If  at  each  split  the  work  is  divided  into  two  parts  of 
proportion  X  and  1  —  X  where  E  [A']  =  ^  and  Var  [A]  =  a2  for  a  <  |,  Sanders  [1994]  shows  that  we 
can  obtain  work  balance  with  high  probability  if  we  select  k  to  be  at  least: 

A  similar,  but  looser  bound  was  obtained  by  Sanders  [1996]  even  if  work  balance  is  adversarial. 


Incremental  Repartitioning 

One  might  consider  occasionally  repartitioning  the  factor  graph  to  improve  balance.  For  example,  we  could 
divide  the  execution  into  T  phases  and  then  repartition  the  graph  at  the  beginning  of  each  phase  based  on 
the  update  patterns  from  the  previous  phase.  To  assess  the  potential  performance  gains  from  incremental 
repartitioning  we  conducted  a  retrospective  analysis.  We  executed  the  sequential  Splash  algorithm  to 
convergence  on  the  denoising  model  and  then  divided  the  complete  execution  into  T  phases  for  varying 
values  of  T.  We  then  partitioned  each  phase  using  the  true  update  counts  for  that  phase  and  estimate  the 
work  imbalance  and  number  of  transmitted  messages.  While  the  true  update  counts  would  not  be  known  in 
practice,  they  provide  an  upper  bound  on  the  performance  gains  of  the  best  possible  predictor. 

In  Figure  3.18  we  plot  the  performance  of  both  optimal  phased  partitioning  and  phased  partitioning  in 
which  future  update  counts  are  predicted  from  previous  update  counts.  In  both  cases  we  consider  phased 
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(a)  Denoise  Work  Balance 


(b)  Denoise  Rel.  Com.  Cost 


Figure  3.17:  The  effect  of  over-partitioning  on  the  work  balance  and  communication  cost.  For  all  points  30 
trials  with  different  random  assignments  arc  used  and  95%  confidence  intervals  are  plotted,  (a)  The  ratio 
of  the  size  of  the  partition  containing  the  most  work,  relative  to  the  ideal  size  (smaller  is  better),  (b)  The 
communication  cost  relative  to  the  informed  partitioning  (smaller  is  better). 


partitioning  for  T  G  {1, . . . ,  10}  phases  on  a  simulated  16  processor  system.  In  Figure  3.18a  we  see 
that  by  repartitioning  more  often  we  actually  slightly  increase  the  average  imbalance  over  the  epochs. 
We  attribute  the  slight  increased  imbalance  to  the  increasingly  irregular  local  update  counts  relative  to 
the  connectivity  structure  (the  regular  grid).  However,  as  seen  in  Figure  3.18b  the  number  of  messages 
transmitted  over  the  network  (i.e.,  the  communication  cost)  relative  to  the  optimal  single  phase  partitioning 
actually  decreases  by  roughly  35%.  Alternatively,  using  the  update  counts  from  the  previous  phase  to 
partition  the  next  phase  confers  a  substantial  decrease  in  overall  balance  (Figure  3.18c)  and  a  dramatic 
increase  in  the  communication  costs  (Figure  3.18d).  We  therefore  conclude  that  only  in  situations  where 
update  counts  can  be  accurately  predicted  and  network  communication  is  the  limiting  performance  can 
incremental  repartitioning  lead  to  improved  performance. 


3.8.2  Distributing  Scheduling 


The  SplashBP  and  ResidualBP  algorithms  rely  on  a  shared  global  priority  queue.  Because  a  centralized 
ordering  is  too  costly  in  the  distributed  setting,  each  processor  maintains  its  own  local  priority  queue  over 
its  vertices.  At  any  point  in  the  distributed  execution  one  of  the  processor  is  always  applying  the  Splash 
operation  to  the  globally  highest  residual  vertex.  Unfortunately,  the  remaining  p—  1  highest  vertices  are  not 
guaranteed  to  be  at  the  top  of  the  remaining  queues  and  so  we  do  not  recover  the  original  shared  memory 
scheduling.  However,  any  processor  with  vertices  that  have  not  yet  converged,  must  eventually  update 
those  vertices  and  therefore  can  always  make  progress  by  updating  the  vertex  at  the  top  of  its  local  queue. 
In  Section  3.8.4  we  show  that  the  collection  of  local  queues  is  sufficient  to  retain  the  original  optimality 
properties  of  the  Splash  algorithm. 
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(a)  Optimal  Balance  (b)  Optimal  Comm. 


2  4  6  8  10 

Number  of  Epochs 


(d)  Naive  Comm. 


Figure  3.18:  Here  we  plot  the  balance  and  communication  costs  of  incremental  repartitioning  against 
the  number  of  epochs  (repartitioning  phases)  in  a  single  execution.  The  balance  is  measured  as  the  work 
ratio  between  the  smallest  and  largest  work  block.  We  present  the  messages  transmitted  measured  relative 
to  the  optimal  performance  for  a  single  phase  rather  than  raw  message  counts.  (a,b)  The  balance  and 
communication  costs  for  optimal  phased  partitioning  where  the  update  counts  in  each  phase  arc  known  in 
advance.  (c,d)  The  balance  and  communication  costs  for  naive  phased  partitioning  where  the  update  counts 
in  each  phase  are  assumed  to  be  the  update  counts  observed  in  the  previous  phase. 


3.8.3  Distributed  Termination 


In  the  absence  of  a  common  synchronized  state  it  is  difficult  to  assess  whether  a  convergence  condition  has 
been  globally  achieved.  The  task  of  assessing  convergence  is  complicated  by  the  fact  that  when  a  node 
locally  converges,  it  could  be  woken  up  at  any  time  by  an  incoming  message  which  may  causes  its  top 
residual  to  exceed  the  termination  threshold. 

Fortunately,  the  distributed  termination  problem  [Matocha  and  Camp,  1998,  Mattem,  1987]  is  well  studied 
with  several  solutions.  We  implement  a  variation  of  the  algorithm  described  in  Misra  [1983].  The  algorithm 
(Alg.  3.14)  pases  a  token  comprising  of  2  integer  values,  NumSent  and  NumReceived,  in  a  fixed  loop 
over  the  machines.  The  integer  values  represent  the  total  number  of  network  messages  sent  and  received 
throughout  the  execution  of  the  program  across  all  machines.  When  the  node  holding  onto  the  token 
converges,  the  node  will  add  the  number  of  network  messages  it  has  sent  and  received  since  the  last  time  it 
has  seen  the  token,  to  NumSent  and  NumReceived  respectively.  The  node  will  then  forward  the  token 
to  the  next  node  in  the  ring.  Global  termination  is  achieved  when  the  token  completes  a  loop  around  the 
ring  without  changing,  and  NumSent  equals  NumReceived. 
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Algorithm  3.14:  Distributed  Termination  Algorithm 
Output:  Converged 

Data:  Global  Variables:  LastToken,  NetMessagesSent,  NetMessagesReceived 
//  Infinite  Loop 
while  True  do 

if  I  have  the  Token  then 

//  Check  for  the  global  termination 
if  Token  =  =  LastToken  AND  Token.NumSent  =  =  Token.NumReceived  then 
|  return  Converged  =  True 

else 

//  Update  the  Token  with  this  node' s  contributions 
//  Reset  the  network  message  counters 

Token.NumSent  -t—  Token.NumSent  +  NetMessagesSent 
Token.NumReceived  Token.NumReceived  +  NetMessagesReceived 
NetMessagesSent «—  0 
NetMessagesReceived  -f-  0 

LastToken -t— Token  / /  Remember  the  last  Token  we  saw 
Transmit(Token,  NextNode)  / /  Send  the  updated  token  to  the  next 
node  in  the  ring 


Sleep  and  Wait  For  Any  Incoming  Network  Message. 

if  Network  Message  is  not  a  Token  then 

//  This  is  not  a  message.  Wake  up  and  process  it 
return  Converged  =  False 


3.8.4  The  Distributed  Splash  Algorithm 

We  now  present  the  distributed  Splash  algorithm  (Alg.  3.15)  which  can  be  divided  into  two  phases,  setup 
and  inference.  In  the  setup  phase,  in  Line  1  we  over-segment  the  input  factor  graph  into  kp  pieces  using  the 
METIS  partitioning  algorithm.  Note  that  this  could  be  accomplished  in  parallel  using  ParMETIS,  however 
our  implementation  uses  the  sequential  version  for  simplicity.  Then  in  Line  2  we  randomly  assign  k  pieces 
to  each  of  the  p  processors.  In  parallel  each  processor  collects  its  factors  and  variables  (Line  3).  On  Line  4 
the  priorities  of  each  variable  and  factor  are  set  to  infinity  to  ensure  that  every  vertex  is  updated  at  least 
once. 

On  Line  5  we  evaluate  the  top  residual  with  respect  to  the  (3  convergence  criterion  and  check  for  termination 
in  the  token  ring.  On  Line  6,  the  DynamicSplash  operation  is  applied  to  v.  In  the  distributed  setting 
Splash  construction  procedure  does  not  cross  partitions.  Therefore  each  Splash  is  confined  to  the  vertices 
on  that  processor.  After  completing  the  Splash  all  external  messages  from  other  processors  are  incorporated 
(Line  7).  Any  beliefs  that  changed  during  the  Splash  or  after  receiving  external  messages  are  promoted  in 
the  priority  queue  on  Line  9.  On  Line  10,  the  external  messages  arc  transmitted  across  the  network.  The 
process  repeats  until  termination  at  which  point  all  beliefs  arc  sent  to  the  originating  processor. 

Empirically,  we  find  that  accumulating  external  messages  and  transmitting  only  once  every  5-10  loops 
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Algorithm  3.15:  The  Distributed  Splash  Algorithm 


//  The  graph  partitioning  phase  ' 

//  Construct  factor  k  over-partitioning  for  p  processors 

1  £>temp  <—  OverSegment  (G,p,  k)  ; 

//  Randomly  assign  over-partitioning  to  the  original  processors 

2  B  •(—  RandomAssign  (Btemp,p) ; 

//  The  distributed  inference  phase  ======================> 

forall  b  e  13  do  in  parallel 

//  Collect  the  variables  and  factors  associated  with  this 
partition 

3  Collect  (Jt,, X(,)  ; 

//  Initialize  the  local  priority  queue 

4  Initialize  (Q); 

//  Loop  until  TokenRing  signifies  convergence 

5  while  TokenRing  (Q,  (3)  do 

v  <-  Top  (Q)  ; 

6  DynamicSplash  (v,  Wmax,  /3) ; 

7  RecvExternalMsgs ( ) ; 

//  Update  priorities  affected  by  the  Splash  and  newly 
received  messages 

8  foreach  u  £  local  changed  vertices  do 

9  Promote  (Q,  IIA&^llj) ; 

to  SendExternalMsgs  ( ) ; 


tends  to  increase  performance  by  substantially  decreasing  network  overhead.  Accumulating  messages 
may  however  adversely  affect  convergence  on  some  graphical  models.  To  ensure  convergence  in  our 
experiments,  we  transmit  on  every  iteration  of  the  loop. 

We  now  show  that  Splash  retains  the  optimality  in  the  distributed  setting. 

Theorem  3.8.1  (Splash  Chain  Optimality).  Given  a  chain  graph  with  n  =  n  vertices  and  p  <  n  processes, 
the  distributed  Splash  algorithm  with  no  over-segmentation,  using  a  graph  partitioning  algorithm  which  re¬ 
turns  connected  partitions,  and  with  work  Splash  size  at  least  2  YIvgV  wv/p  w;7/  obtain  a  re- approximation 

in  expected  running  time  O  +  reY 


Proof  of  Theorem  3.8.1.  We  assume  that  the  chain  graph  is  optimally  sliced  into  p  connected  pieces  of 
n/p  vertices  each.  Since  every  vertex  has  at  most  2  neighbors,  the  partition  has  at  most  2 n/p  work.  A 
Splash  anywhere  within  each  partition  will  therefore  cover  the  entire  partition,  performing  the  complete 
forward-backward  update  schedule. 

Because  we  send  and  receive  all  external  messages  after  every  splash,  after  \  fjj^\  iterations,  every  vertex 
will  have  received  messages  from  vertices  a  distance  of  at  least  t€  away.  The  runtime  will  therefore  be: 


2  n 
—  x 
P 
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Since  each  processor  only  send  2  external  messages  per  iteration  (one  from  each  end  of  the  partition), 
communication  therefore  only  adds  a  constant  to  the  total  runtime.  □ 


3.8.5  Algorithm  Comparison  in  the  Distributed  Setting 

Distributed  experiments  were  conducted  on  cluster  composed  of  13  blades  (nodes)  each  with  2  Intel  Xeon 
E5345  2.33GHz  Quad  core  processors.  Each  processor  is  connected  to  the  memory  on  each  node  by  a 
single  memory  controller.  The  nodes  are  connected  via  a  fast  Gigabit  ethernet  switch.  We  invoked  the 
weighted  kmetis  partitioning  routine  from  the  METIS  software  library  for  graph  partitioning  and  partitions 
were  computed  in  under  10  seconds. 

To  assess  the  performance  of  the  BP  algorithms  in  the  distributed  setting  we  focus  our  attention  on  larger 
models.  To  adequately  challenge  the  distributed  algorithms  we  consider  the  Elidan  protein  networks 
(with  a  stricter  termination  bound  (3  =  10“ 10)  and  the  larger  UW-CSE  MLNs  with  the  termination  bound 
/ 3  =  10-5.  All  the  baseline  algorithms  converged  on  the  Elidan  protein  networks  except  synchronous 
running  on  one  processor  which  ran  beyond  the  10,000  second  cutoff.  Consistent  with  the  multicore  results, 
only  the  Splash  algorithm  converges  on  the  larger  UW-CSE  MLNs. 

Since  all  the  baseline  algorithms  converged  on  the  smaller  Elidan  networks  we  use  these  networks  for 
algorithm  comparison.  Because  the  single  processor  runtime  on  these  networks  is  only  a  few  minutes  we 
analyze  performance  from  1  to  40  processors  in  increments  of  5  rather  than  using  the  entire  104  processor 
system.  As  discussed  in  the  experimental  setup,  we  consider  over-partition  factors  of  2,  5,  and  10  for  each 
algorithm  and  present  the  results  using  the  best  over-partition  factors  for  each  of  the  respective  algorithms 
(which  is  typically  5).  In  Figure  3.19a  and  Figure  3.19b  we  present  the  standard  runtime  and  speedup  curves 
which  show  that  the  Splash  algorithm  achieves  the  best  runtime  and  speedup  performance  with  a  maximum 
speedup  of  roughly  23  at  40  processors.  While  the  speedup  does  not  scale  linearly  we  achieve  a  runtime  of 
6.4  seconds  at  40  processors  making  it  difficult  to  justify  extending  to  an  additional  80  processors.  We  also 
plot  the  amount  of  work  Figure  3.19c  as  a  function  of  the  number  of  processors  and  find  that  despite  the 
message  delays  and  distributed  scheduling,  the  algorithmic  efficiency  remains  relatively  constant  with  the 
Splash  and  Wildfire  algorithm  performing  optimally. 

In  Figure  3.1 9d  we  plot  the  total  amount  of  network  traffic  measured  in  bytes  as  a  function  of  the  number  of 
processors  and  we  find  that  the  Splash  algorithm  performs  the  minimal  amount  of  network  communication. 
Furthermore  the  amount  of  network  communication  scales  linearly  with  the  number  of  processors.  We  also 
find  in  Figure  3.19e  that  the  Splash  algorithm  is  able  to  more  fully  utilize  the  cluster  by  executing  more 
message  calculations  per  processor-second  and  doing  so  consistently  as  the  number  of  processor  scales. 
Finally,  we  compare  the  algorithms  in  terms  of  their  communication  efficiency  measured  in  total  bytes  sent 
per  processor-second.  The  total  bytes  sent  per  processor-second,  plotted  in  Figure  3.19f  is  a  measure  of 
the  network  bandwidth  used  per  machine.  Again  we  find  that  the  Splash  requires  the  minimal  network 
bandwidth  and  that  the  network  bandwidth  grows  sub-linearly  with  the  number  of  processors.  At  peak  load 
we  use  less  than  a  megabyte  per  second  per  machine. 

The  larger  UW-CSE  MLNs  provide  the  best  demonstration  of  the  effectiveness  of  the  Splash  algorithm. 
None  of  the  other  inference  algorithms  converge  on  these  models,  and  the  models  arc  sufficiently  large  to 
justify  the  use  of  a  cluster.  In  Figure  3.20  and  Figure  3.21,  we  provide  plots  for  the  UW-Systems  and  the 
UW-AI  MLNS  respectively. 

The  UW-Systems  model  is  larger,  and  we  could  demonstrate  a  linear  to  super-linear  speedup  up  to  about  65 
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Figure  3.19:  Distributed  Results  for  the  Elidanl  Protein  Network 


processors.  At  65  processors,  the  runtime  is  about  23.6  seconds  with  a  partition  factor  of  10.  The  speedup 
curves  start  to  flatten  out  beyond  that.  The  plot  in  Figure  3.20b  justify  the  claim  that  over-partitioning  can 
increase  performance  through  improved  load-balancing.  Over-partitioning  however,  does  lead  to  increased 
network  communication  as  seen  in  Figure  3.20d  and  Figure  3.20f.  At  65  processors,  over  100  MB  of 
network  messages  were  communicated  in  23  seconds. 

The  UW-AI  model  in  Figure  3.21  has  similar  results.  However,  as  the  model  is  smaller,  we  notice  that  the 
speedup  curves  start  to  flatten  out  earlier,  demonstrating  linear  speedup  only  up  to  about  39  processors. 
Once  again,  at  65  processors,  the  runtime  is  only  22.7  seconds  with  a  partition  factor  of  10. 


3.9  Additional  Related  Work 


While  we  focus  on  approximate  inference  methods,  much  of  the  existing  work  on  parallel  algorithms 
for  graphical  models  focuses  on  exact  inference  which  is  extremely  computationally  intensive.  Most 
exact  inference  algorithms  operate  on  the  junction  tree  which  is  an  acyclic  representation  of  the  factorized 
distribution.  One  of  the  earliest  parallel  exact  inference  algorithm  was  introduced  by  Kozlov  and  Singh 
[1994]  who  implemented  the  Lauritzen-Spiegelhalter  (see  Koller  and  Friedman  [2009])  algorithm  on  a  32 
core  Stanford  DASH  processor.  Later  Pennock  [1998]  provided  a  method  to  both  construct  the  junction 
tree  and  apply  Lauritzen-Spiegelhalter  with  log  depth  inference  using  a  form  of  parallel  tree  contraction 
[Miller  and  Reif,  1985].  While  the  work  by  Pennock  [1998]  provides  an  optimal  algorithm  it  requires 
substantial  fine  grain  coordination.  Later  Xia  and  Prasanna  [2008,  2010]  and  Xia  et  al.  [2009]  implemented 
the  algorithm  for  various  platforms  and  demonstrated  limited  scaling  on  models  with  high  treewidth. 
Unfortunately,  the  complexity  of  inference  in  junction  trees  is  exponential  in  the  treewidth  which  grows 
quickly  in  loopy  models  with  many  variables.  Therefore,  for  most  real-world  large-scale  models  the 
treewidth  is  typically  prohibitively  large. 
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Figure  3.20:  Distributed  Parallel  Results  for  the  UW-Sy stems  MLN 
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Figure  3.21:  Distributed  Parallel  Results  for  the  UW-AI  MLN 
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In  the  sequential  setting  their  has  been  an  increased  emphasis  on  asynchronous  prioritized  scheduling  for 
belief  propagation.  Soni  et  al.  [2010]  examined  the  use  of  asynchronous  domain  specific  scheduling  to 
accelerate  convergence.  Prior  to  our  work  on  SplashBP,  Wainwright  et  al.  [2006]  proposed  an  alternative 
factorization  of  approximate  inference  algorithms  along  spanning  trees.  This  work  conceptually  corresponds 
to  scheduling  message  updates  along  spanning  trees.  After  our  work  on  SplashBP,  Tarlow  et  al.  [2011] 
proposed  a  prioritized  tree  based  inference  algorithm  for  max-product  linear  program  solver  which  closely 
resembles  SplashBP.  It  is  encouraging  to  see  that  the  GrAD  methodology  not  only  leads  to  efficient  parallel 
algorithms  but  is  equally  applicable  in  the  design  of  efficient  sequential  algorithms. 

The  re- approximation  can  be  related  to  work  in  approximate  message  passing  algorithms.  In  particular', 
work  by  Ihler  et  al.  [2005]  studied  the  effect  of  errors  in  the  convergence  of  belief  propagation.  They  related 
the  dynamic  range  of  factors,  a  measure  of  sensitivity  to  changes  in  message  values,  to  the  convergence 
and  accuracy  of  loopy  belief  propagation.  Consistent  with  our  re-approximation  they  found  that  messages 
errors  typically  decay  over  long  chains. 


3.10  Conclusion 

Approximate  parallel  inference  algorithms  are  important  to  a  wide  range  of  machine  learning  applications. 
In  this  chapter  we  reviewed  the  natural  parallelization  of  the  belief  propagation  algorithm  using  the 
synchronous  schedule  and  demonstrated  both  theoretically  and  empirically,  how  this  can  lead  to  a  highly 
inefficient  parallel  algorithm. 

Using  the  GrAD  design  principles  we  gradually  improved  upon  the  Synchronous  BP  algorithm  by  gradually 
introducing  asynchronous  dynamic  scheduling.  This  lead  to  the  sequence  of  algorithms,  Round  Robin, 
Wildfire,  and  Residual  BP,  each  improving  upon  the  previous.  Finally  by  identifying  the  serial  sub-problems, 
chains  and  more  generally  trees,  we  developed  the  Splash  BP  algorithm,  which  sequentially  moves  messages 
along  spanning  trees.  We  theoretically  demonstrated  that  the  SplashBP  algorithm  is  optimal  on  chains  but 
generalizes  to  arbitrary  cyclic  models. 

We  discussed  how  to  implement  each  parallel  belief  propagation  algorithm  in  the  shared  memory  setting 
using  basic  locking  primitives  and  parallel  priority  queues.  We  iden tided  load  balancing  and  communication 
contention  in  the  context  of  dynamic  scheduling  as  key  challenges  to  a  distributed  belief  propagation 
algorithms  and  presented  a  simple  over-partitioning  method  to  address  these  challenges.  Finally,  we 
experimentally  compared  each  parallel  belief  propagation  algorithm  demonstrating  the  substantial  gains  of 
applying  the  GrAD  design  methodology  in  both  the  shared  memory  and  distributed  memory  settings. 
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Chapter  4 


Probabilistic  Inference:  Gibbs  Sampling 


Gibbs  sampling  is  a  popular  MCMC  inference  procedure  used  widely  in  statistics  and  machine  learning. 
However,  on  many  models  the  Gibbs  sampler  can  be  slow  mixing  [Barbu  and  Zhu,  2005,  Kuss  and 
Rasmussen,  2005].  Consequently,  a  number  of  authors  [Asuncion  et  al.,  2008,  Doshi-Velez  et  al.,  2009, 
Newman  et  al.,  2007,  Yan  et  al.,  2009]  have  proposed  parallel  methods  to  accelerate  Gibbs  sampling. 
Unfortunately,  most  of  the  recent  methods  rely  on  synchronous  Gibbs  updates  that  are  not  ergodic,  and 
therefore  generate  chains  that  do  not  converge  to  the  targeted  stationary  distribution. 

By  applying  the  GrAD  methodology  we  introduce  three  ergodic  parallel  Gibbs  samplers.  The  first,  called 
the  Chromatic  sampler,  applies  a  classic  technique  relating  graph  coloring  to  parallel  job  scheduling,  to 
obtain  a  direct  parallelization  of  the  classic  sequential  scan  Gibbs  sampler.  We  show  that  the  Chromatic 
sampler  is  provably  ergodic  and  provide  strong  guarantees  on  the  parallel  reduction  in  mixing  time.  We  then 
introduce  the  Asynchronous  Sampler  which  relaxes  the  Chromatic  schedule  and  introduces  the  Markov 
Blanket  Locking  protocol  to  allow  greater  asynchrony  while  preserving  ergodicity. 

For  the  relatively  common  case  of  models  with  two-colorable  Markov  random  fields,  the  Chromatic  sampler 
provides  substantial  insight  into  the  behavior  of  the  non-ergodic  Synchronous  Gibbs  sampler.  We  show  that 
in  the  two-colorable  case,  the  Synchronous  Gibbs  sampler  is  equivalent  to  the  simultaneous  execution  of 
two  independent  Chromatic  samplers  and  provide  a  method  to  recover  the  corresponding  ergodic  chains. 
As  a  consequence,  we  are  able  to  derive  the  invariant  distribution  of  the  Synchronous  Gibbs  sampler  and 
show  that  it  is  ergodic  with  respect  to  functions  over  single  variable  marginals. 

The  Chromatic  sampler  achieves  a  linear  increase  in  the  rate  at  which  samples  are  generated  and  is  therefore 
ideal  for  models  where  the  variables  are  weakly  coupled.  However,  for  models  with  strongly  coupled 
variables,  the  chain  can  still  mix  prohibitively  slowly.  In  this  case,  it  is  often  necessary  to  jointly  sample 
large  subsets  of  related  random  variables  [  Barbu  and  Zhu,  2005,  Jensen  and  Kong,  1996]  in  what  is  known 
as  a  blocking  Gibbs  sampler. 

The  third  parallel  Gibbs  sampler,  the  Splash  sampler,  addresses  the  challenges  of  highly  correlated 
variables  by  incrementally  constructing  multiple  bounded  tree-width  blocks,  called  Splashes,  and  then 
jointly  sampling  each  Splash  using  parallel  junction-tree  inference  and  backward-sampling.  To  ensure  that 
multiple  simultaneous  Splashes  are  conditionally  independent  (and  hence  that  the  chain  is  ergodic),  we 
introduce  a  Markov  blanket  locking  protocol.  To  accelerate  burn-in  and  ensure  high  likelihood  states  are 
reached  quickly,  we  introduce  a  vanishing  adaptation  heuristic  for  the  initial  samples  of  the  chain,  which 
explicitly  builds  blocks  of  strongly  coupled  variables. 
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We  construct  optimized  implementation  of  the  Chromatic  and  Splash  parallel  samplers  and  compare 
performance  on  synthetic  and  real-world  sampling  problems  using  a  32  processor  multicore  system.  The 
Asynchronous  sampler  is  a  special  case  of  the  Splash  sampler  with  Splash  size  set  to  one.  We  find  that 
both  algorithms  achieve  strong  speedups  in  sample  generation,  and  the  adaptive  Splash  sampler  can  further 
accelerate  mixing  on  strongly  correlated  models.  Our  experiments  illustrate  that  the  two  sampling  strategies 
complement  each  other:  for  weakly  coupled  variables,  the  Chromatic  sampler  performs  best,  whereas  the 
Splash  sampler  is  needed  when  strong  dependencies  arc  present. 


4.1  The  Gibbs  Sampler 


The  Gibbs  sampler  was  introduced  by  Geman  and  Geman  [1984]  and  is  a  popular  Markov  Chain  Monte 
Carlo  (MCMC)  algorithm  used  to  simulate  samples  from  the  joint  distribution.  The  Gibbs  sampler  is 
constructed  by  iteratively  sampling  each  variable, 


Xi  ~  P  {Xi  |  X_y.  =  x,v- )  (X  /a (Xt ,  )  (4.1) 

A:ieA,Ae.F 

given  the  assignment  to  the  variables  in  its  Markov  blanket.  Geman  and  Geman  [1984]  showed  that  if  each 
variable  is  sampled  infinitely  often  and  under  reasonable  assumptions  on  the  conditional  distributions  (e.g., 
positive  support),  the  Gibbs  sampler  is  ergodic  (i.e.,  it  converges  to  the  true  distribution).  While  we  have 
considerable  latitude  in  the  update  schedule,  we  shall  see  in  subsequent  sections  that  certain  updates  must 
be  treated  with  care:  in  particular,  Geman  and  Geman  [1984]  were  incorrect  in  their  claim  that  parallel 
simultaneous  sampling  of  all  valuables  (the  Synchronous  update)  yields  an  ergodic  chain. 

For  large  models  with  complex  dependencies,  the  mixing  time  and  even  the  time  required  to  obtain  a 
high  likelihood  sample  can  be  substantial.  Therefore,  we  would  like  to  use  parallel  resources  to  increase 
the  speed  of  the  Gibbs  sampler.  The  simplest  method  to  construct  a  parallel  sampler  is  to  run  a  separate 
chain  on  each  processor.  Flowever,  running  multiple  parallel  chains  requires  large  amounts  of  memory 
and,  more  importantly,  is  not  guaranteed  to  accelerate  mixing  or  the  production  of  high-likelihood  samples. 
As  a  consequence,  we  focus  on  single  chain  parallel  acceleration,  where  we  apply  parallel  methods  to 
increase  the  speed  at  which  a  single  Markov  chain  is  advanced.  The  single  chain  setting  ensures  that  any 
parallel  speedup  directly  contributes  to  an  equivalent  reduction  in  the  mixing  time,  and  the  time  to  obtain  a 
high-likelihood  sample. 

Unfortunately,  recent  efforts  to  build  parallel  single-chain  Gibbs  samplers  have  struggled  to  retain  ergodicity. 
The  resulting  methods  have  relied  on  approximate  sampling  algorithms  [Asuncion  et  al.,  2008]  or  proposed 
generally  costly  extensions  to  recover  ergodicity  [Doshi- Velez  et  al.,  2009,  Ferrari  et  al.,  1993,  Newman 
et  al.,  2007].  Here  we  will  apply  the  GrAD  methodology  to  design  efficient  parallel  Gibbs  samplers  while 
ensuring  ergodicity. 


4.2  The  Synchronous  Gibbs  Sampler 


A  naive  single  chain  parallel  Gibbs  sampler  is  obtained  by  sampling  all  variables  simultaneously  on  separate 
processors.  Called  the  Synchronous  Gibbs  sampler,  this  highly  parallel  algorithm  (Alg.  4.1)  was  originally 
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Algorithm  4.1:  The  Synchronous  Gibbs  Sampler 


forall  Variables  X}  do  in  parallel 

|  Execute  Gibbs  Update:  X^+1^  ~  P  (Xj  | 


proposed  by  Geman  and  Geman  [1984].  Unfortunately  the  extreme  parallelism  of  the  Synchronous  Gibbs 
sampler  comes  at  a  cost.  As  others  (e.g.,  Newman  et  al.  [2007])  have  observed,  one  can  easily  construct 
cases  where  the  Synchronous  Gibbs  sampler  is  not  ergodic  and  therefore  does  not  converge  to  the  correct 
stationary  distribution. 


For  example,  suppose  we  draw  (  X\ 


v(t—  l) 


from  the  multivariate  normal  distribution: 


P  <*\ 


and  then  use  the  Synchronous  Gibbs  sampler  to  generate  (x[*\  x'J' 


easily  show  that: 


given  ( X{ 


'{t—  1)  v(t—  1) 


One  can 


where  the  covariance  term  is  not  preserved.  Note,  however  that  the  marginals  remain  consistent.  Alterna¬ 
tively,  in  the  discrete  case,  consider  the  two  binary  variable  discrete  model  P  {X\.  X2)  where 


P  ( Xi  =  xi,X2  =  x2)  oc 


e  x\  7^  x2 
1  —  e  x\  =  x2 


for  some  0  <  e  <  1.  The  transition  matrix  of  the  Synchronous  sampler  which  samples  from  both  variables 
simultaneously  is  therefore: 


(0-0) 

TT^T 

(l-e)e 


(0.1)  (1.0) 

(l-e)e  (l-0« 

(1  -  c)2 


(l-e)e 

(l-e)e 


(1,0)  (l-e)e  (1-e)2  c2  (l-e)e 
(1,1)||  62  |  (l-e)e  |  (l-e)e  |  (1-e)2 

which  has  the  incorrect  stationary  distribution  (first  eigenvector  is  1): 

P(X1,X2)<xl 

Again,  the  marginals  P  {X\  )  =  P  {X\ )  and  P  (X2)  =  P  (X2)  arc  coiTectly  estimated. 

Not  only  does  the  Synchronous  Gibbs  sampler  converge  to  the  wrong  distribution,  it  can  actually  decelerate 
burn-in,  the  time  it  takes  to  reach  high-likelihood  samples.  If  we  return  to  the  simple  two  variable  discrete 
model  and  start  with  an  assignment  where  X\  /  X>  then  the  probability  that  X\  =  X2  after  a  single 
iteration  is: 

P  (A"ew  =  Xfew  |  J“ld  /  X^ld)  =  2e(l  -  e)  <  2e 

Therefore  for  small  e  the  probability  of  transitioning  to  a  high  likelihood  state  vanishes  when  run  syn¬ 
chronously  but  occurs  almost  surely  when  run  asynchronously. 
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Algorithm  4.2:  The  Chromatic  Sampler 
Input:  fc-Colored  MRF 

for  For  each  of  the  k  colors  nt  :  i  £  {1 .... .  /;:  }  do 

forall  Variables  Xj  G  n,  in  the  ith  color  do  in  parallel 

|  Execute  Gibbs  Update:  X(-+l)  ~  P  [x5  \  ^^K<i,  x$.eK>. 

end 

end 


4.3  The  Chromatic  Sampler 


Just  as  with  the  Synchronous  BP  algorithm  in  Chapter  3,  we  can  improve  upon  the  Synchronous  Gibbs 
sampler  by  scheduling  computation  asynchronously.  In  particular  we  want  to  construct  a  parallel  execution 
of  Gibbs  updates  that  has  an  equivalent  sequential  execution  and  therefore  retains  ergodicity.  We  can 
achieve  this  serial  equivalence  by  applying  a  classic  technique  from  parallel  computing  and  introduce  the 
Chromatic  parallel  Gibbs  sampler. 

The  Chromatic  Gibbs  sampler  (Alg.  4.2)  begins  by  constructing  a  /e-coloring  of  the  MRF.  A  /.'-coloring 
of  the  MRF  assigns  one  of  k  colors  to  each  vertex  such  that  adjacent  vertices  have  different  colors.  The 
Chromatic  sampler  simultaneously  draws  new  values  for  all  variables  in  one  color  before  proceeding 
to  the  next  color.  The  fc-coloring  of  the  MRF  ensures  that  all  variables  within  a  color  are  conditionally 
independent  given  the  variables  in  the  remaining  colors  and  can  therefore  be  sampled  independently  and  in 
parallel. 


By  combining  a  classic  result  ([Bertsekas  and  Tsitsiklis,  1989,  Proposition  2.6])  from  parallel  computing 
with  the  original  Geman  and  Geman  [1984]  proof  of  ergodicity  for  the  sequential  Gibbs  sampler  one  can 
easily  show: 


Proposition  4.3.1  (Graph  Coloring  and  Parallel  Execution).  Given  p  processors  and  a  k-coloring  of 
an  n-variable  MRF,  the  parallel  Chromatic  sampler  is  ergoclic  and  generates  a  new  joint  sample  in  running 
time: 


O 


Proof  From  [Bertsekas  and  Tsitsiklis,  1989,  Proposition  2.6]  we  know  that  the  parallel  execution  of  the 
Chromatic  sampler  corresponds  exactly  to  the  execution  of  a  sequential  scan  Gibbs  sampler  for  some 
permutation  over  the  variables.  The  running  time  can  be  easily  derived: 


□ 

Therefore,  given  sufficient  parallel  resource  (p  <S  O  (n))  and  a  fc-coloring  of  the  MRF,  the  parallel  Chromatic 
sampler  has  running-time  O  (k),  which  for  many  modeling  tasks  is  constant  in  the  number  of  vertices.  It  is 
important  to  note  that  this  parallel  gain  directly  results  in  a  factor  of  p  reduction  in  the  mixing  time  as  well 
as  the  time  to  reach  a  high-likelihood  sample. 
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(a)  Synchronous  Chain  (b)  Two  Ergodic  Chains 

Figure  4. 1 :  (a)  Execution  of  a  two  colored  model  using  the  synchronous  Gibbs  sampler.  The  dotted  lines  represent  dependencies 
between  samples,  (b)  Two  ergodic  chains  obtained  by  executing  the  Synchronous  Gibbs  sampler.  Note  that  ergodic  sums  with 
respect  to  marginals  are  equivalent  to  those  obtained  using  the  Synchronous  sampler. 


Unfortunately,  constructing  the  minimal  coloring  of  a  general  MRF  is  NP-Complete.  However,  for  many 
common  models  the  optimal  coloring  can  be  quickly  derived.  For  example,  given  a  plate  diagram,  we 
can  typically  compute  an  optimal  coloring  of  the  plates,  which  can  then  be  applied  to  the  complete  model. 
When  an  optimal  coloring  cannot  be  trivially  derived,  we  find  simple  graph  coloring  heuristics  (see  Kubale 
[2004])  perform  well  in  practice. 


4.3.1  Properties  of  2-Colorable  Models 

Many  common  models  in  machine  learning  have  natural  two-colorings.  For  example,  Latent  Dirichlet 
Allocation,  the  Indian  Buffet  process,  the  Boltzmann  machine,  hidden  Markov  models,  and  the  grid  models 
commonly  used  in  computer  vision  all  have  two-colorings.  For  these  models,  the  Chromatic  sampler 
provides  additional  insight  into  properties  of  the  Synchronous  sampler.  The  following  theorem  relates  the 
Synchronous  Gibbs  sampler  to  the  Chromatic  sampler  in  the  two-colorable  setting  and  provides  a  method 
to  recover  two  ergodic  chains  from  a  single  Synchronous  Gibbs  chain: 

Theorem  4.3.2  (2-Color  Ergodic  Synchronous  Samples).  Let  (jW)™0  be  the  non-ergodic  Markov 
chain  constructed  by  the  Synchronous  Gibbs  sampler  ( Alg .  4.1)  then  using  only  (A'W)”10  we  can  construct 
two  ergodic  chains  (7^)™  0  and  (Z(*))£L0  which  are  conditionally  independent  given  and  correspond 
to  the  simultaneous  execution  of  two  Chromatic  samplers  (Alg.  4.2). 

Proof.  We  split  the  chain  (lW)™0  =  (X$  ,  xi£ )tLo  over  the  two  colors  and  then  construct  the  chains 
(y(*))£L0  and  (Z(L)  )"l(l  by  simulating  the  two  Chromatic  Gibbs  samplers,  which  each  advance  only  one 
color  at  a  time  conditioned  on  the  other  color  (as  illustrated  in  Figure  4.1): 


Observe  that  no  samples  arc  shared  between  chains,  and  given  X((>>  both  chains  are  independent.  Finally, 
because  both  derived  chains  arc  simulated  from  the  Chromatic  sampler  they  arc  provably  ergodic.  □ 

Using  the  partitioning  induced  by  the  2-coloring  of  the  MRF  we  can  analytically  construct  the  invariant 
distribution  of  the  Synchronous  Gibbs  sampler: 
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Theorem  4.3.3  (Invariant  Distribution  of  Sync.  Gibbs).  Let  (XK] ,  XK2 )  =  X  be  the  partitioning  of  the 
variables  over  the  two  colors,  then  the  invariant  distribution  of  the  Synchronous  Gibbs  sampler  is  the 
product  of  the  marginals  P  (XK  l )  P  (XK2 ). 

Proof  Let  the  current  state  of  the  sampler  be  X  =  (XKl ,  XK2 )  and  the  result  of  a  single  Synchronous  Gibbs 
update  be  X'  =  (X'K  ,  X'K  ).  By  definition  the  Synchronous  Gibbs  update  simulates  X'K  ~  P  {x'K  |  xK2) 
and  X'K2  ~  P  (x'K2  \  xKl ).  Therefore  the  transition  kernel  for  the  Synchronous  Gibbs  sampler  is: 

K(x'  |  x)  =  P  {x'Kl  |  xK2)  P  (x'K2  |  xK1) 

We  can  easily  show  that  P  (XKl)  P  (XK2)  is  the  invariant  distribution  of  the  Synchronous  Gibbs  sampler: 

P  {x  )  =  J2  K(x'  |  x)P  (x) 

*«a)  p  (XK2  I  xKl)  P  (xKl)P  (xK2) 

X'k2)  P  {x'kwXk2) 

XK1  XK2 

=  p«)p(0 


=  SHP« 

Xr,-^  Xk2 

=  SHPKi 


□ 


A  useful  consequence  of  Theorem  4.3.3  is  that  when  computing  ergodic  averages  over  sets  of  variables 
with  the  same  color,  we  can  directly  use  the  non-ergodic  Synchronous  samples  and  still  obtain  convergent 
estimators: 

Corollary  4.3.4  (Monochromatic  Marginal  Ergodicity).  Given  a  sequence  of  samples  (x^)^f0  drawn 
from  the  Synchronous  Gibbs  sampler  on  a  two-colorable  model,  empirical  expectations  computed  with 
respect  to  single  color  marginals  are  ergodic: 


V/,  i  6  1,2:  lim  /(s£?)  E  [f(XKi)\ 


m— >-oo  777, 

t= 0 


Corollary  4.3.4  justifies  many  applications  where  the  Synchronous  Gibbs  sampler  is  used  to  estimate  single 
variables  marginals  and  explains  why  the  Synchronous  Gibbs  sampler  performs  well  in  these  settings. 
However,  Corollary  4.3.4  also  highlights  the  danger  of  computing  empirical  expectations  over  variables 
that  span  both  colors  without  splitting  the  chains  as  shown  in  Theorem  4.3.2. 

We  have  shown  that  both  the  Chromatic  sampler  and  the  Synchronous  sampler  can  provide  ergodic  samples. 
However  the  Chromatic  sampler  is  clearly  superior  when  the  number  of  processors  is  less  than  the  half 
the  number  of  vertices  (p  <  n/2 )  since  it  will  advance  a  single  chain  twice  as  fast  as  the  Synchronous 
sampler. 


4.4  Asynchronous  Gibbs  Sampler 

While  the  Chromatic  sampler  performs  optimally  in  many  applications,  there  are  scenarios  in  which  the 
Chromatic  sampler  can  be  inefficient  or  where  we  may  favor  more  control  over  the  execution  order  at 
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the  expense  of  added  complexity.  If  the  time  required  to  draw  from  conditions  differs  substantially  it  is 
easy  to  construct  a  simple  scenario  where  p  —  1  processors  wait  unnecessarily  even  when  the  optimal 
coloring  is  used.  Alternatively,  we  may  want  to  sample  variables  at  different  frequencies  or  use  adaptive 
prioritization  mechanisms  like  those  proposed  in  Levine  and  Casella  [2006].  In  this  section  we  construct  the 
Asynchronous  Gibbs  Sampler,  which  is  the  parallel  equivalent  of  the  Random  Scan  Gibbs  sampler. 

In  the  Asynchronous  Gibbs  Sampler,  each  processor  randomly  draws  the  next  variable  to  sample  from 
some  (potentially  adaptive  subject  to  the  conditions  in  Levine  and  Casella  [2006])  distribution.  Before  a 
processor  can  sample  a  new  value  for  that  variable,  it  must  first  acquire  the  corresponding  Markov  Blanket 
Lock  using  the  Blanket  Locking  (MBL)  Protocol.  The  Blanket  Locking  protocol  associates  a  Rcad/Writc 
lock  with  each  variable  in  the  model.  The  Markov  blanket  lock  for  variable  Xv  is  obtained  by  acquiring  the 
read-locks  on  all  neighboring  variables  Xjyv  and  the  write  lock  on  Xv.  To  ensure  that  the  Blanket  Locking 
Protocol  is  Dead-lock  free,  locks  are  always  acquired  and  released  using  the  canonical  ordering  of  the 
variables.  The  Markov  blanket  locking  protocol  (MBL)  is  actually  equivalent  to  the  locking  protocol  used 
in  the  LoopyBP  SendMessages  routine  described  in  Alg.  3.5. 

We  show  that  the  Asynchronous  Gibbs  sampler  is  ergodic  and  equivalent  to  the  sequential  execution  of 
the  Random  Scan  Gibbs  sampler.  Ergodicity  relies  on  the  notion  of  Sequential  Consistency.  The  locking 
strategy  employed  by  the  Asynchronous  Gibbs  sampler  ensures  that  there  exists  a  sequential  execution 
which  recovers  the  same  output  as  every  parallel  execution.  Alternatively,  since  we  lock  the  Markov 
Blanket,  we  can  show  that  all  variables  sampled  simultaneously  are  also  conditionally  independent  given 
the  assignment  to  the  remaining  variables. 

One  of  the  major  drawback  to  the  asynchronous  Gibbs  sampler  is  that  processors  can  collide  on  locks, 
resulting  in  sequentialization  and  leading  to  a  loss  in  parallel  performance.  We  can,  none  the  less,  provide 
an  encouraging  lower  bound  on  the  expected  number  of  processors  that  may  operate  simultaneously  (i.e., 
execute  without  colliding  on  a  lock).  If  we  consider  a  small  grid  (d  =  4)  Ising  model  with  n  =  202 
variables  then  using  p  =  8  processors.  Theorem  4.4.1  states  that  in  expectation  we  will  be  able  to  use  at 
least  7  of  the  processors  in  parallel.  The  intuition  behind  the  bound  in  Theorem  4.4.1  is  that  at  least  one 
processor  will  always  run  and  the  remaining  p  —  1  remaining  processors  run  successfully  with  probability 
at  most  (1  —  (p  —  1  )(d  +  1  )/n). 

Theorem  4.4.1  (Randomized  Asynchronous  Parallelism).  If  p  processors  simultaneously  try  to  grab 
variables  uniformly  at  random  from  an  MRF  with  n  random  variables  and  maximum  degree  d  then  the 
expected  number  of  processors  which  will  be  able  to  execute  in  parallel  is  at  least: 

E  [#  Processors  Running  in  Parallel]  >  l  +  (p-l)(l-(p-l)(^))  (4.3) 

Proof  A  processor  is  blocked  if  another  processor  is  running  on  that  vertex  or  any  of  its  neighbors 
(d  +  1  )/n  and  any  of  the  (p  —  1)  other  processors  can  block  a  vertex.  Therefore  a  lower  bound  on  the 
probability  that  one  of  the  (p  —  1)  processors  will  succeed  is  given  by  ( 1  —  (p  —  l)(d  +  1  )/n).  □ 

4.5  The  Splash  Gibbs  Sampler 

The  Chromatic  sampler  provides  a  linear  speedup  for  single-chain  sampling,  advancing  the  Markov  chain 
for  a  /c-colorable  model  in  time  O  +  k)  rather  than  0(n ).  Unfortunately,  some  models  have  strongly 
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Algorithm  4.3:  Parallel  Splash  Sampler 


Input:  Maximum  treewidth  wmax 
Input:  Maximum  Splash  size  hmax 

while  t  <  oo  do 

//  Make  p  bounded  treewidth  Splashes 
{JsJiLi  F-  ParSplash  (wmax,  hmax,  x ^  ) ; 

//  Calibrate  each  junction  trees 
{Jsjf=i  F-  ParCalibrate  {Jsjf=1) ; 

//  Sample  each  Splash 
{xSi}Pi=i  ParSample  ({JsJ?=1)  ; 

/ /  Advance  the  chain 


I  ^Si  > 


(*) 

nr »  <-»  'J' 

yp 


correlated  variables  and  complex  dependencies,  which  can  cause  the  Chromatic  sampler  to  mix  prohibitively 
slowly. 

In  the  single  processor  setting,  a  common  method  to  accelerate  a  slowly  mixing  Gibbs  sampler  is  to 
introduce  blocking  updates  [Barbu  and  Zhu,  2005,  Hamze  and  de  Freitas,  2004,  Jensen  and  Kong,  1996]. 
In  a  blocked  Gibbs  sampler,  blocks  of  strongly  coupled  random  variables  are  sampled  jointly  conditioned 
on  their  combined  Markov  blanket.  The  blocked  Gibbs  sampler  improves  mixing  by  enabling  strongly 
coupled  variables  to  update  jointly  when  individual  conditional  updates  would  force  the  chain  to  move 
through  low-probability  states  in-order  to  mix  or  reach  high-probability  states. 

To  improve  mixing  in  the  parallel  setting  we  introduce  the  Splash  sampler  (Alg.  4.3),  a  general  puipose 
blocking  sampler.  For  each  joint  sample,  the  Splash  sampler  exploits  parallelism  both  to  construct  multiple 
random  blocks,  called  Splashes,  and  to  accelerate  the  joint  sampling  of  each  Splash.  To  ensure  each  Splash 
can  be  safely  and  efficiently  sampled  in  parallel,  we  developed  a  novel  Splash  generation  algorithm  which 
incrementally  builds  multiple  conditionally  independent  bounded  treewidth  junction  trees  for  every  new 
sample.  In  the  initial  rounds  of  sampling,  the  Splash  algorithm  uses  a  simple  adaptation  heuristic  which 
groups  strongly  dependent  variables  together  based  on  the  state  of  the  chain.  Adaptation  is  then  disabled 
after  a  finite  number  of  rounds  to  ensure  ergodicity. 

We  present  the  Splash  sampler  in  three  parts.  First,  we  present  the  parallel  algorithm  used  to  construct 
multiple  conditionally  independent  Splashes.  Next,  we  describe  the  parallel  junction  tree  sampling 
procedure  used  to  jointly  sample  all  variables  in  a  Splash.  Finally,  we  present  our  Splash  adaptation 
heuristic  which  sets  the  priorities  used  during  Splash  generation. 


4.5.1  Parallel  Splash  Generation 

The  Splash  generation  algorithm  (Alg.  4.4)  uses  p  processors  to  incrementally  build  p  disjoint  Splashes 
in  parallel.  Each  processor  grows  a  Splash  rooted  at  a  unique  vertex  in  the  MRF  (Line  1).  To  preserve 
ergodicity  we  require  that  no  two  roots  share  a  common  edge  in  the  MRF,  and  that  every  variable  is  a  root 
infinitely  often. 

Each  Splash  is  grown  incrementally  using  a  best  first  search  (BeFS)  of  the  MRF.  The  exact  order  in 
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(a)  Noisy  (b)  FIFO  Splashes  (c)  FIFO  Splashes  (+)  (d)  Priority  Splashes  (e)  Priority  Splashes  (+) 


Figure  4.2:  Different  Splashes  constructed  on  a  200  x  200  image  denoising  grid  MRF.  (a)  A  noisy  sunset 
image.  Eight  Splashes  of  treewidth  5  were  constructed  using  the  FIFO  (b)  and  priority  (d)  ordering.  Each 
splash  is  shown  in  a  different  shade  of  gray  and  the  black  pixels  are  not  assigned  to  any  Splash.  The 
priorities  were  obtained  using  the  adaptive  heuristic.  In  (c)  and  (e)  we  zoom  in  on  the  Splashes  to  illustrate 
their  structure  and  the  black  pixels  along  the  boundary  needed  to  maintain  conditional  independence. 


which  variables  are  explored  is  determined  by  the  call  to  NextVertexToExplore  ( B )  on  Line  2  of 
Alg.  4.4  which  selects  (and  removes)  the  next  vertex  from  the  boundary  B.  In  Figure  4.2  we  plot  several 
simultaneous  Splashes  constructed  using  a  first-in  first-out  (FIFO)  ordering  (Figure  4.2b)  and  a  prioritized 
ordering  (Figure  4.2d). 

The  Splash  boundary  is  extended  until  there  are  no  remaining  variables  that  can  be  safely  added  or  the 
Splash  is  sufficiently  large.  A  variable  cannot  be  safely  added  to  a  Splash  if  sampling  the  resulting  Splash 
is  excessively  costly  (violates  a  treewidth  bound)  or  if  the  variable  or  any  of  its  neighbors  are  members  of 
other  Splashes  (violates  conditional  independence  of  Splashes). 

There  are  two  core  challenges  to  constructing  multiple  bounded  computation  Splashes  in  parallel.  First,  we 
need  a  fast  incremental  algorithm  to  estimate  the  computational  cost  of  sampling  a  Splash;  and  second,  we 
need  a  deadlock-free,  race-free  protocol  to  ensure  that  simultaneously  constructed  Splashes  are  conditionally 
independent. 

To  bound  the  computational  complexity  of  sampling,  and  to  jointly  sample  the  Splash,  we  rely  on  junction 
trees.  A  junction  tree,  or  clique  graph,  is  an  undirected  acyclic  graphical  representation  of  the  joint 
distribution  over  a  collection  of  random  variables.  For  a  Splash  containing  the  variables  we  construct 
a  junction  tree  ( C ,  E)  =  Js  representing  the  conditional  distribution  P  (X$  \  X-s).  The  set  of  vertices 
C  e  C  is  called  a  clique  and  is  a  subset  of  the  vertices  (i.e.,  C  C  S)  in  the  Splash  S.  The  cliques  satisfy  the 
constraint  that  for  every  factor  domain  A  £  T  there  exists  a  clique  C  €  C  such  that  A  Pi  A  C  C.  The  edges 
E  of  the  junction  tree  satisfy  the  running  intersection  property  (RIP)  which  ensures  that  all  cliques  sharing 
a  common  variable  form  a  connected  bee. 

The  computational  complexity  of  inference,  and  consequently  sampling  using  a  junction  tree,  is  exponential 
in  the  treewidth;  one  less  than  number  of  variables  in  the  largest  clique.  Therefore,  to  evaluate  the 
computational  cost  of  adding  a  new  variable  Xv  to  the  Splash,  we  need  an  efficient  method  to  extend  the 
junction  tree  J5  over  X$  to  a  junction  tree  .1,5 +J,  over  A^yr.j/i  and  evaluate  the  resulting  treewidth. 

To  efficiently  build  incremental  junction  trees,  we  developed  a  novel  junction  tree  extension  algorithm 
(Alg.  4.5)  which  emulates  standard  variable  elimination,  with  variables  being  eliminated  in  the  reverse 
of  the  order  they  are  added  to  the  Splash  (e.g.,  if  X,  is  added  to  J5  then  Xt  is  eliminated  before  all  X$). 
Because  each  Splash  grows  outwards  from  the  root,  the  resulting  elimination  ordering  is  optimal  on  tree 
MRFs  and  typically  performs  well  on  cyclic  MRFs. 
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Algorithm  4.4:  Parallel S plasli :  Parallel  Splash  Generation 
Input:  Maximum  treewidth  »'max 
Input:  Maximum  Splash  size  /tmax 
Output:  Disjoint  Splashes  {Si, . . . ,  Sp} 

do  in  parallel  i  £  {1 

1  r  4—  NextRoot  ( i)  //  Unique  roots 
Sj  4—  {r}  //  Add  r  to  splash 

B  4—  Mr  /  /  Add  neighbors  to  boundary 
Vt-{r}UjVr//  Visited  vertices 
J  5i  4—  JunctionTree  ({r}) 
while  (|Sj|  <  hmax)  A  (|#|  >  0)  do 

2  v  4—  NextVertexToExplore (B) 

MarkovBlanketLock (Xv) 

//  Check  that  v  and  its  neighbors  Mv  are  not  in  other 
Splashes . 

safe  4—  ({n}  u  Mv)  n  S?)  =  0 

4 —  ExtendJunctionTree  (J^,  v) 
if  safe  A  TreeWidth  ( J^i+V )  <  wmax  then 
4— J5i+„ //  Accept  new  tree 
Si  4 —  Si  U  {n} 

3  B 4—  B U  (Mv\V) / /  Extend  boundary 
V  4 —  V  U  Mv  / /  Mark  visited 

MarkovBlanketFree (Xv) 


The  incremental  junction  tree  extension  algorithm  (Alg.  4.5)  begins  by  eliminating  X,  and  forming  the  new 
clique  C,  =  (Mi  n  S)  U  {  {}  which  is  added  to  We  then  attach  C,  to  the  most  recently  added  clique 
CPar,]  that  contains  a  variable  in  C,  (CPau  denotes  the  parent  of  C,;).  The  running  intersection  property 
(RIP)  is  then  restored  by  propagating  the  newly  added  variables  back  up  the  tree.  We  define  the  remaining 
unsatisfied  variables  1Z  =  CA  (A  and  insert  TZ  into  its  parent  clique  CPau.  The  RIP  condition  is  now 
satisfied  for  variables  in  7 Z  which  were  already  in  CPau.  The  parent  for  CPau  is  then  recomputed,  and  any 
unsatisfied  variables  are  propagated  up  the  tree  in  the  same  way.  We  demonstrate  this  algorithm  with  a 
simple  example  in  Figure  4.3. 


To  ensure  that  simultaneously  constructed  Splashes  are  conditionally  independent,  we  employ 
the  Markov  blanket  locking  (MBL)  protocol  from  the  Asynchronous  Gibbs  sampler.  Once  the 
MarkovBlanketLock  (Xv )  has  been  acquired,  no  other  processor  can  assign  Xv  or  any  of  it  neighbors 
Xjifv  to  a  Splash.  Therefore,  we  can  safely  test  if  Xv  or  any  of  its  neighbors  Xyfv  are  currently  assigned 
to  other  Splashes.  Since  we  only  add  Xv  to  the  Splash  if  both  Xv  and  all  its  neighbors  are  currently 
unassigned  to  other  Splashes,  there  will  never  be  an  edge  in  the  MRF  that  connects  two  Splashes. 
Consequently,  simultaneously  constructed  Splashes  are  conditionally  independent  given  all  remaining 
unassigned  variables. 
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Algorithm  4.5:  ExtendJunctionTree  Algorithm 
Input:  The  original  junction  tree  (C,  E )  =  J5. 

Input:  The  variable  X,  to  add  to  J5 
Output:  J,5+j 

Define:  Cu  as  the  clique  created  by  eliminating  u  G  S 

Define:  v[C]  G  S  as  the  variable  eliminated  when  creating  C 

Define:  t[n]  as  the  time  v  G  S  was  added  to  S 

Define:  Pa[n]  G  Mv  n  S  as  the  next  neighbor  of  v  to  be  eliminated. 

C i  <—  {Mi  nS)  u  {i} 

Pa[t]  E-  argmax,ueCA{i}  t[u] 

// - Repair  RIP - 

TZ  <-  C*\  {i}  //  RIP  Set 
t:  t-  Pa[i] 

while  \1Z\  >  0  do 

CV<—CVU7Z//  Add  variables  to  parent 
w  G-  arg maxu,gCj)y{^j  t  [m]  / /  Find  new  parent 
if  w  =  P  a  [n]  then 
|  K^(K\Ci)\{i} 

else 

n  g-  {KuCi)\{i} 

Pa[r]  f-  w//  New  parent 

w  t—  Pa[r]  //  Move  upwards 


4.5.2  Parallel  Splash  Sampling 


Once  we  have  constructed  p  conditionally  independent  Splashes  {<Si}?=1,  we  jointly  sample  each  Splash 
by  drawing  from  P  (Xsf  |  X-s, )  in  parallel.  Sampling  is  accomplished  by  first  calibrating  the  junction  trees 
{ J S,  \\=  \  >  an(i  dicn  running  backward-sampling  stalling  at  the  root  to  jointly  sample  all  the  variables  in  each 
Splash.  We  also  use  the  calibrated  junction  trees  to  construct  Rao-Blackwellized  marginal  estimators.  If  the 
treewidth  or  the  size  of  each  Splash  is  large,  it  may  be  beneficial  to  construct  fewer  Splashes,  and  instead 
assign  multiple  processors  to  accelerate  the  calibration  and  sampling  of  individual  junction  trees. 


To  calibrate  the  junction  tree  we  use  the  ParCalibrate  function  (Alg.  4.6).  The  ParCalibrate 
function  constructs  all  clique  potentials  in  parallel  by  computing  the  products  of  the  assigned  factors 
conditioned  on  the  variables  not  in  the  junction  tree.  Finally,  parallel  belief  propagation  is  used  to  calibrate 
the  tree  by  propagating  messages  in  parallel  following  the  optimal  forward- bac k ward  schedule.  At  each 
level  individual  message  computations  are  done  in  parallel. 


Parallel  backward-sampling  is  accomplished  by  the  function  Par  Sample  which  takes  the  calibrated 
junction  tree  and  draws  a  new  joint  assignment  in  parallel.  The  Par  Sample  function  begins  by  drawing 
a  new  joint  assignment  for  the  root  clique  using  the  calibrated  marginal.  Then,  in  parallel,  each  child  is 
sampled  conditioned  on  the  parent  assignment  and  the  messages  from  the  children. 
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Figure  4.3:  Incremental  Junction  Tree  Example:  The  junction  tree  on  the  top  comprises  the  subset 
of  variables  {1,2, 4, 5, 6}  of  the  MRF  (center).  The  tree  is  formed  by  the  variable  elimination  ordering 
{6, 1,2,5, 4}  (reading  the  underlined  variables  of  the  tree  in  reverse).  To  perform  an  incremental  insertion  of 
variable  3,  we  first  create  the  clique  formed  by  the  elimination  of  3  ({1,2, 3, 6})  and  insert  it  into  the  end 
of  the  tree.  Its  parent  is  set  to  the  latest  occurrence  of  any  of  the  variables  in  the  new  clique.  Next  the  set 
{1,2,6}  is  inserted  into  its  parent  (boldface  variables),  and  its  parent  is  recomputed  in  the  same  way. 


Algorithm  4.6:  ParallelCalibrate:  Junction  Calibration 

Input:  {JsJLi 

do  in  parallel  processor  i  £  {1, . . . ,  p  \ 
forall  Cliques  c  £  .T5,  do  in  parallel 

//  Initialize  clique  potentials 

MXc)  Ufaecfa(Xc,x{Jc) 

•(—  ParallelBPCalibrate  ({V’c}c&s!. ) 

Output:  Calibrated  junction  trees  {Js,:}f=1 


4.5.3  Adaptive  Splash  Generation 


As  discussed  earlier,  the  order  in  which  variables  arc  explored  when  constructing  a  Splash  is  determined 
on  Line  2  in  the  ParSplash  algorithm  (Alg.  4.4).  We  propose  a  simple  adaptive  prioritization  heuristic, 
based  on  the  current  assignment  to  x^\  that  prioritizes  variables  at  the  boundary  of  the  current  tree  which 
are  strongly  coupled  with  variables  already  in  the  Splash.  We  assign  each  variable  Xv  £  B  a  score  using 
the  likelihood  ratio: 


s[Xv] 


Ex 

l/~vrr  _ 

p 

[xs,Xv  =x\X_s  =  x%' 

) 

log 

p 

(4.4) 


and  includes  the  variable  with  the  highest  score.  Effectively,  Eq.  (4.4)  favors  variables  with  greater  average 
log  likelihood  than  conditional  log  likelihood.  We  illustrate  the  consequence  of  applying  this  metric  to 
an  image  denoising  task  in  which  we  denoise  the  synthetic  image  shown  in  Figure  4.2a.  In  Figure  4.4  we 
show  a  collection  of  Splashes  constructed  using  the  score  function  Eq.  (4.4).  To  see  how  priorities  evolve 
over  time,  we  plot  the  update  frequencies  early  (Figure  4.4a)  and  later  (Figure  4.4b)  in  the  execution  of  the 
Splash  scheduler. 
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Figure  4.4:  The  update  frequencies  of  each  variable  in  the  200  x  200  image  denoising  grid  MRF  for  the 
synthetic  noisy  image  shown  in  Figure  4.2a.  The  brighter  pixels  have  been  prioritized  higher  and  are 
therefore  updated  more  frequently,  (a)  The  early  update  counts  are  relatively  uniform  as  the  adaptive 
heuristic  has  not  converged  on  the  priorities,  (b)  The  final  update  counts  arc  focused  on  the  boundaries  of 
the  regions  in  the  model  corresponding  to  pixels  that  can  be  most  readily  changed  by  blocked  steps. 


4.5.4  A  Need  for  Vanishing  Adaptation 


Given  the  flexibility  in  scheduling  Gernan  and  Genian  [1984]  ascribed  to  the  Gibbs  sampler,  it  would  seem 
reasonable  to  continually  adapt  the  update  schedule  and  Splash  shape  especially  since  any  individual  Splash 
preserves  the  underlying  distribution.  Indeed,  Levine  and  Casella  [2006]  make  a  similar  claim  about  their 
Algorithm  7.1:  which  is  a  generalization  of  adaptive  Gibbs  samplers.  Unfortunately,  we  can  demonstrate, 
by  means  of  a  simple  counter  example,  that  tuning  the  Splash  (blocking)  in  general  breaks  ergodicity  even 
when  we  include  all  variables  with  positive  probability. 

Consider  the  case  of  two  uniform  independent  binary  variables  X\  and  X2  with  constant  probability  mass 
function  P  (xi,x2)  oc  1.  We  define  an  adaptive  sampling  procedure  which  picks  a  variable  to  sample 
uniformly  at  random  if  both  variables  have  the  same  assignment  (xq  =  x2).  However,  if  the  variables  have 
different  assignments  (xq  %  X2),  we  pick  the  variable  with  assignment  1  with  90%  probability.  Let  the 
choice  of  variable  to  sample  (Splash)  be  represented  by  P  (S  X\ ,  X2)  where  S  C  {1,2}.  Let  (Xl,X2) 
be  the  current  state  and  (X(,  X2)  be  the  new  state  after  one  step  of  the  procedure.  Then  the  transition 
kernel  is  given  by  (where  1  [•]  is  the  indicator  function): 


P  {X'i,X'2  |  S  =  {1}  ,  Xi,X2)  oc  1  [x2  =  x2] 
P(X{,X'|5  =  {2},X1,X2)  oc  l[x}  =  xi] 


Marginalizing  the  assignments  to  (Xi,X2)  with  respect  to  the  true  probability  P  (Aq,  X2)  we  ob- 
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tain: 


P(X[=x'1,X/2  =  x'2)  =  EE  P(S|Xi,X2)P(X(,X'  \S,X1,X2)P(X1,X2) 


XI,  X2  S 


EE  P  {S  I  X1,X2)  P  (X[,  X'  I  5,  X1,X2) 


Xl,X2  S 


^  P  (cS  =  {1}  I  X1;  X2)  1  [x'2  =  X2]  +  p  (5  =  {2}  I  X1;  X2)  1  K  =  si; 


^P(5  =  {1}  \Xl,X2  =  x'2)  +  ^P(5  =  {2}  |  X\  =  x[,  X2) 


=  1  [x2  =  l]  (0.5  +  0.1)  +  1  [x'2  =  0]  (0.9  +  0.1) 

+  1  [x[  =  l]  (0.5  +  0.1)  +  1  [si  =  0]  (0.9  +  0.1), 


where  the  last  step  of  the  process  is  computing  by  substituting  in  the  values  of  P  (<S  |  X\,X2)  as  defined  by 
the  adaptation  process. 

The  resulting  distribution  of  P  (X[,X2)  is 


which  clearly  is  not  the  correct  stationary  distribution  and  the  Gibbs  sampler  is  therefore  not  ergodic.  This 
counter  example  refutes  the  proof  of  Algorithm  7. 1  in  Levine  and  Casella  [2006],  and  places  conditions 
upon  the  Gibbs  sampler  described  in  [Geman  and  Gernan,  1984].  Specifically,  simply  sampling  from  every 
variable  infinitely  often  is  insufficient  for  ergodicity. 

To  ensure  the  chain  remains  ergodic,  we  disable  prioritized  tree  growth  after  a  finite  number  of  iterations, 
and  replace  it  with  a  random  tree  growth.  This  form  of  vanishing  adaptation  helps  move  to  high- 
likelihood  samples  and  can  accelerate  early  mixing  (burn-in).  However,  ultimately  we  would  like  to  develop 
adaptation  heuristics  that  do  not  require  vanishing  adaptation  and  believe  this  is  an  important  area  of  future 
research. 

By  employing  vanishing  adaptation  we  show  that  our  Splash  sampler  with  vanishing  is  ergodic: 

Theorem  4.5.1  (Splash  Sampler  Ergodicity).  The  adaptive  Splash  sampler  with  vanishing  adaptation  is 
ergodic  and  converges  to  the  true  distribution. 

Proof.  We  prove  Theorem  4.5.1  in  three  parts.  First  we  show  that  P  (X)  is  the  invariant  distribution  of  the 
Splash  Sampler: 

P  (X(t+1))  =  K  (-E(m)  I  Xw  =  x)  P  (Xw  =  x)  (4.5) 

X 

Second,  we  show  that  the  Splash  sampler  forgets  its  stalling  state: 

sup  P  (Xw  =  x  |  X(0)  =yj-  P  (Xw  =  x  |  X(0)  =  z)  x  <  7t/n  (4.6) 
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(4.7) 


Finally,  we  show  that  the  sampler  draws  from  P  (X)  in  the  limit: 

lim  sup  P  (x®  =x\X{®  =y)  -  P  (X  =  x)  =0 

t—t  OO  y^x  V  /I 

Distributional  Invariance:  Because  vanishing  adaptation  is  used  we  show  that  P  (X)  is  preserved  for 
Splash  updates  without  adaptation:  where  the  choice  of  Splash  S  0)  does  not  depend  on  the  state  of  the  chain. 
However,  we  do  allow  the  choice  of  the  Splash  to  depend  on  all  previous  Splash  choices  {5l  0h  . . . ,  S(L>  }. 
The  transition  kernel  for  the  Splash  sampler  given  the  Splash  S  is  defined  as: 


K  (X(t+1)  |  x(t))  =  X]  P  (5  I  S(0) '  ■ '  s(t~1))  1  X-s 1)  =  X-S  P  (Xs+1)  I  X-S 1))  (4-8) 

Substituting  Eq.  (4.8)  into  Eq.  (4.5)  we  obtain: 


(x(i+1)  I  P 

(4.9) 

^^p(s|5(0)...Sw 

x(‘)  5 

)  1  a£j1}  =  x%  P  (4+1)  |  X{Y1])  P  (x(t)) 

(4.10) 

Y  P  (5  |  S(0) . . 

P 

(4+1)  |  X41})  Y  1  [*S1}  =  x%]  P  (jfW) 

(4.11) 

^  P  («S  |  s(0) . . 

<5 

,5(t))  P 

(x^+1)  1  x(Ls  L))  P  (X-51}) 

(4.12) 

P  (S  |  s(0) . . 
s 

,5(t))  P 

(4.13) 

p  (x(m)) 

(4.14) 

Dependence  on  Starting  State:  We  now  show  that  the  Splash  sampler  forgets  its  initial  starting  state.  This 
is  done  by  first  showing  that  there  is  a  positive  probability  of  reaching  any  state  after  a  bounded  number  of 
Splash  operations.  Then  we  use  this  strong  notion  of  irreducibility  to  setup  a  recurrence  which  bounds  the 
dependence  on  the  starting  state. 

The  Splash  algorithm  sweeps  across  the  roots  ensuring  that  after  At  <  n  tree  updates  all  variables  are 
sampled  at  least  once.  Define  the  time  tt  as  the  last  time  variable  X,  was  updated.  Without  loss  of  generality 
(by  renumbering  the  variables)  we  assume  that  t\  <  to  <  . . .  <  tn.  We  can  then  bound  the  probability  that 
after  n  Splashes  we  reach  state  x  given  we  were  initially  at  state  y, 

n 

P  fl(")  =  x  I  x(°)  =  y)  >  TT  inf  P  (Xi  =  Xi  |  =  x-i) .  (4.15) 

\  /  X  —  i 

2—1 

Define  the  smallest  conditional  probability, 

5  =  mfP(Xi  =  xi\X-i  =  x_i),  (4.16) 

i,x 

then  we  can  lower  bound  the  probability  of  reach  any  state: 

P  (X(n)  =  x  |  X(0)  =yj>  Sn.  (4.17) 
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Effectively,  we  are  stating  that  P  (X)  is  irreducible  since  all  the  conditionals  are  positive.  As  a  consequence 
we  can  show  that  there  exists  an  7  such  that  0  <  7  <  1  and 


sup 

x,y,z 


=  x\X(0)  =  y')  -  P  (x(t)  =  x\X(0]  =  z')  <7*/n.  (4.18) 

This  is  trivially  true  for  t  =  0.  We  can  rewrite  the  left  side  of  the  above  equation  as 


sup 

x,y,z 


=  X  |  X(0)  =y]-p(  X®  =  X  |  X(0)  = 


sup 


sup P  (Xw  =  X  I  X(0)  =yj-  inf  P  (Xw  =  x  |  X(0)  =  * 


(4.19) 


Now  we  will  bound  the  inner  sup  and  inf  terms.  For  t  >  n  we  can  introduce  a  probability  measure  /t  such 
that 


supP 

y 


=  x  |  XW  =  y) 

=  sup  Y  p  ( x(t)  =  X  |  x(n)  =  P  f X ^  =  w  I  =  y 


<  sup  V  P  (x^  =  X  I  iW  =  '(/;)  n(w) 
v  j 


(4.20) 

(4.21) 


Then  by  defining  the  w*  as 


w  =  arg  sup  J 


P  ( X{t)  =  x  I  X(n)  = 


w  , 


(4.22) 


we  can  easily  construct  the  maximizing  //  which  places  the  minimal  mass  5n  on  all  w  except  w*  and  places 
the  remaining  mass  1  —  (|fl|  —  1)4”  on  w*.  This  leads  to 


sup  P 


=  x\X(®  =y')  <  (1  -  (|fi|  -  l)^n)P  (x®  =x\X^  =W*J  + 


sn  y  p  =  x  1 x(n)  = 

Wy^W* 


w 


(4.23) 

(4.24) 


Similarly  we  define  w*  as  the  minimizing  element  in  <> 

=  arg  inf  P  (x^  =  x  \  X ^  =  w 
W  V 

We  can  then  construct  the  lower  bound, 

infP  (x(t)  =  x\Xw  =  z')  >  (1  -  (|fi|  -  1)<T)P  (x®  =x\X^n)  = 

sn  y  p  (x(t)  = x  1 x(n)  =  w 

Wy^W* 


(4.25) 


w, 


(4.26) 

(4.27) 
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Taking  the  difference,  we  get 

P  (  X™  =  x  |  A(0)  =y)  -  P  (X®  =  x  |  X(0)  =  z 


sup 

x,y,z 


=  sup 

x 


sup . 

y 


P  (X®  =  x  |  X(0)  =  yj  -  inf  P  (\ X (t)  =  x  \  A(0)  =  z  | 

<  sup  II  (1  -  (|fi|  -  1)<5")  P  (x®  =  X  I  X(n^  =  w*)  +  5n  Y  p  (^(<)  =  *  I  x(n)  =  w) 


Wy^W* 


(1  -  (|fi|  -  1  )6n)  P  (x®  =  x  |  XW  =  m*)  -  6n  Y  P  (x(t)  =  x  I  x(n)  = 

w^w* 


We  can  bound  the  sum  terms 

5n  Y  P  (-*(*)  =  *  |  X(n)  =  w)-8n  Y  P  TO)  =  x  I  X(n )  =  w)  < 

W^W*  W^W* 

5n  Y  p  TO)  =  *  |  A(n)  =w)-SnY  p  TO)  =  *  I  X(n)  =  w)  =  0, 


and  then  simplify  Eq.  (4.29)  to  obtain 

sup  P  (x®  =  x  |  A(0)  =  y]  -  P  ( X W  =  x  |  X(0)  =  z 

x,y,z  ' 


< 


(1  -  (|fi|  -  l)<5n)  sup  P  (TO  =  x  |  I(n)  =  y)  -  P  (TO  =  x  |  XW  = 

x,y,z 


We  can  repeat  this  procedure  f/n  times  to  the  term  |  |P  (jW  =  x  |  ATO  =  y)  —  P  (A'W  =  x\X^ 
on  the  right  side  of  the  above  equation  to  obtain  the  desired  result. 


sup 

x,y  ,z 


=  x  I  A(°)  =yj-  P  =  x  1 1(0)  =  z)  <  (1  -  (|fi|  -  1  )Sn)t/n 


We  can  use  Eq.  (4.31)  along  with  7 r  invariance  to  finish  the  proof, 
lim  sup  P  (x®  =  y  \  X^  =  x)  —  P  (y) 

t - ^OO  y  x  V  / 


=  lim  sup 


t — yoo 


y,x 


=  lim  sup 

t— >00  yjX 


=  lim  sup 


t—>  OO 


y,x 


P  (jW  =  y  I  TO)  =  x)  -  Y  P  (2/ 1  *)  P  0 

Z 

p  JxW  =  2/ 1  TO)  =  x)  -  Y  P  (xW  =  V  I  x(0)  =  p  (z) 

Z 

Y  p  (*)  (p  (TO  =  y  I  TO)  =  x)  -  P  (iW  =  y  1 1(°)  =  z 


<  lim  sup  P  ( Xw  =  y  |  TO)  =  x)  -  P  ( =  y  |  TO}  = 


t — ^OO 


y,xtz  1 


<  lim  (1  -  (|fi|  -  l)(5n)t/n 

t— >  OO 

=  0. 


(4.28) 

u;)  Hi- 

(4.29) 


(4.30) 

TOIIr 

(4.31) 

(4.32) 

(4.33) 

(4.34) 

(4.35) 

(4.36) 

(4.37) 

□ 
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Figure  4.5:  Comparison  of  Chromatic  sampler  and  the  Splash  sampler  at  different  settings  (i.e., 
Chromatic  (p),  Splash  (p,  iumax,  adaptation)  for  p  processors  and  treewidth  u>max)  on  the  synthetic  im¬ 
age  denoising  grid  model  and  the  Cora  Markov  logic  network.  Adaptation  was  not  used  in  the  CORA- 1 
MLN.  (a,e)  The  un-normalized  log-likelihood  plotted  as  a  function  of  running-time.  (b,f)  The  variance  in 
the  estimator  of  the  expected  assignment  computed  across  10  independent  chains  with  random  stalling 
points.  (c,g)  The  total  number  of  variables  sampled  in  a  20  second  window  plotted  as  a  function  of  the 
number  of  cores.  (d,h)  The  speedup  in  number  of  samples  drawn  as  a  function  of  the  number  of  processors. 


4.6  Experiments 


We  implemented  an  optimized  C++  version  of  both  the  Chromatic  and  Splash  samplers  for  arbitrary 
discrete  factorized  models.  Although  Alg.  4.3  is  presented  as  a  sequence  of  synchronous  parallel  steps, 
our  implementation  splits  these  steps  over  separate  processors  to  maximize  performance  and  eliminate 
the  need  for  threads  to  join  between  phases.  While  the  calibration  and  sampling  of  individual  junction 
trees  can  also  be  parallelized,  we  found  that  for  the  typically  small  treewidth  used  in  our  experiments, 
the  overhead  associated  the  additional  parallelism  overrode  any  gains.  Nonetheless,  when  we  made  the 
treewidth  sufficiently  large  (e.g.,  106  sized  factors)  we  were  able  to  obtain  13x-speedup  injunction  tree 
calibration  on  32  cores. 

To  evaluate  the  proposed  algorithms  in  both  the  weakly  and  strongly  correlated  setting  we  selected  two 
representative  large-scale  models.  In  the  weakly  correlated  setting  we  used  a  40, 000  variable  200  x  200 
grid  MRF  similar  to  those  used  in  image  processing.  The  latent  pixel  values  were  discretized  into  5  states. 
Gibbs  sampling  was  used  to  compute  the  expected  pixel  assignments  for  the  synthetic  noisy  image  shown 
in  Figure  4.2a.  We  used  Gaussian  node  potentials  centered  around  the  pixel  observations  with  a2  =  1 
and  Ising-Potts  edge  potentials  of  the  form  exp(— 35 (xi  /  Xj)).  To  test  the  algorithms  in  the  strongly 
correlated  setting,  we  used  the  CORA-1  Markov  Logic  Network  (MLN)  obtained  from  Domingos  [2009]. 
This  large  real-world  factorized  model  consists  of  over  10, 000  variables  and  28, 000  factors,  and  has  a 
much  higher  connectivity  and  higher  order  factors  than  the  pairwise  MRF. 
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In  Figure  4.5  we  present  the  results  of  running  both  algorithms  on  both  models  using  a  state-of-the-art  32 
core  Intel  Nehalem  (X7560)  server  with  hyper-threading  disabled.  We  plot  un-normalized  log-likelihood 
and  across  chain  variance  in-terms  of  wall-clock  time.  In  Figure  4.5a  and  Figure  4.5e  we  plot  the  un¬ 
normalized  log-likelihood  of  the  last  sample  as  a  function  of  time.  While  in  both  cases  the  Splash  algorithm 
out-performs  the  chromatic  sampler,  the  difference  is  more  visible  in  the  CORA-1  MLN.  We  found  that  the 
adaptation  heuristic  had  little  effect  in  likelihood  maximization  on  the  CORA- 1  MLN,  but  did  improve 
performance  on  the  denoising  model  by  focusing  the  Splashes  on  the  higher  variance  regions.  In  Figure  4.5b 
and  Figure  4.5f  we  plot  the  variance  in  the  expected  variable  assignments  across  10  independent  chains 
with  random  stalling  points.  Here  we  observe  that  for  the  faster  mixing  denoising  model,  the  increased 
sampling  rate  of  the  Chromatic  sampler  leads  to  a  greater  reduction  in  variance  while  in  the  slowly  mixing 
CORA-1  MLN  only  the  Splash  sampler  is  able  to  reduce  the  variance. 

To  illustrate  the  parallel  scaling  we  plot  the  number  of  samples  generated  in  a  20  seconds  (Figure  4.5c 
and  Figure  4.5d)  as  well  as  the  speedup  in  sample  generation  (Figure  4.5c  and  Figure  4.5d).  The  speedup 
is  computed  by  measuring  the  multiple  of  the  number  of  samples  generated  in  20  seconds  using  a  single 
processor.  The  ideal  speedup  is  linear  with  32  x  speedup  on  32  cores. 

We  find  that  the  Chromatic  sampler  typically  generates  an  order  of  magnitude  more  samples  per  second 
than  the  more  costly  Splash  sampler.  However,  if  we  examine  speedup  curves  we  see  that  the  larger 
cost  associated  with  the  Splash  construction  and  inference  contributes  to  more  exploitable  coarse  grain 
parallelism.  Interestingly,  in  Figure  4.5h  we  see  that  the  Splash  sampler  exceeds  the  ideal  scaling.  This  is 
actually  a  consequence  of  the  high  connectivity  forcing  each  of  the  parallel  Splashes  to  be  smaller  as  the 
number  of  processors  increases.  As  a  consequence  the  cost  of  computing  each  Splash  is  reduced  and  the 
sampling  rate  increases.  However,  this  also  reduces  some  of  the  benefit  from  the  Splash  procedure  as  the 
size  of  each  Splash  is  smaller  resulting  a  potential  increase  in  mixing  time. 


4.7  Additional  Related  Work 


As  discussed  in  the  motivation,  the  majority  of  existing  large-scale  sampling  efforts  have  not  preserved 
ergodicity  and  in  many  cases  focused  on  largely  synchronous  updates.  However,  some  more  recent  sampling 
efforts  have  taken  a  strongly  asynchronous  approach.  For  example,  Smola  and  Narayanamurthy  [2010] 
constructed  one  of  the  largest  collapsed  Gibbs  samplers  for  LDA.  This  approach  is  entirely  asynchronous 
with  each  variable  being  sampled  as  fast  as  possible.  While  this  approach  has  not  yet  been  shown  to 
converge  to  the  correct  distribution,  it  seems  to  perform  extremely  well  in  practice. 

Later  work  by  Ahmed  et  al.  [2012]  improves  upon  the  work  by  Smola  and  Narayanamurthy  [2010], 
and  introduces  the  parameter  server  a  powerful  new  primitive  for  expressing  highly  asynchronous  but 
unstructured  computation.  The  parameter  server  is  a  large  distributed  key-value  store  which  supports 
eventually  consistent  [Vogels,  2009]  transactions  using  a  novel  delta  update  strategy.  Each  processor  then 
sequentially  updates  its  local  variables  writing  their  new  values  to  the  parameter  server.  As  a  consequence 
the  updates  are  locally  consistent  but  globally  inconsistent.  The  system  them  replicates  these  changes  to  the 
remaining  processors  as  quickly  as  possible.  While  this  work  does  not  leverage  the  graphical  dependency 
structure,  the  novel  use  of  asynchrony  is  consistent  with  the  GrAD  methodology. 

In  the  single  processor  setting  there  has  been  a  lot  of  work  on  blocking  Gibbs  samplers  [Barbu  and  Zhu, 
2005,  Hamze  and  de  Freitas,  2004,  Jensen  and  Kong,  1996].  Our  Splash  algorithm  is  inspired  by  the 
work  of  Hamze  and  de  Freitas  [2004]  which  studies  the  case  of  constructing  a  fixed  set  of  tree-width  one 
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spanning  trees.  They  provide  deeper  analysis  of  the  potential  acceleration  in  mixing  achieved  by  tree  based 
blocking  with  Rao-Blackwellization. 


4.8  Conclusion 

In  this  chapter  we  used  the  GrAD  design  methodology  to  develop  three  ergodic  parallel  Gibbs  samplers  for 
high-dimensional  models:  the  Chromatic  sampler,  the  Asynchronous  Sampler,  and  the  Splash  sampler.  The 
Chromatic  parallel  Gibbs  sampler  can  be  applied  where  single  variable  updates  still  mix  well,  and  uses 
graph  coloring  techniques  to  schedule  conditionally  independent  updates  in  parallel.  We  related  Chromatic 
sampler  to  the  commonly  used  (but  non-ergodic)  Synchronous  Gibbs  sampler,  and  showed  that  we  can 
recover  two  ergodic  chains  from  a  single  non-ergodic  Synchronous  Gibbs  chain. 

We  relaxed  the  strict  chromatic  schedule  to  introduce  the  Asynchronous  Gibbs  sampler.  The  Asynchronous 
Gibbs  sampler  relies  on  the  Markov  Blanket  Locking  protocol  to  ensure  ergodicity  and  actually  a  special 
case  of  the  Splash  Gibbs  sampler  with  Splash  size  set  to  one. 

In  settings  with  tightly  couples  variables,  the  parallelism  afforded  by  the  Chromatic  Gibbs  sampler 
may  be  insufficient  to  achieve  rapid  mixing.  We  therefore  proposed  the  Splash  Gibbs  sampler  which 
incrementally  constructs  multiple  conditionally  independent  bounded  treewidth  blocks  (Splashes)  in 
parallel.  We  developed  a  novel  incremental  junction  tree  construction  algorithm  which  quickly  and 
efficiently  updates  the  junction  tree  as  new  variables  are  added.  We  further  proposed  a  procedure  to 
accelerate  bum-in  by  explicitly  grouping  strongly  dependent  variables,.  An  interesting  topic  for  future 
work  is  whether  one  can  design  an  adaptive  parallel  sampler  (i.e.,  one  that  continually  modifies  its  behavior 
depending  on  the  current  state  of  the  chain)  that  still  retains  ergodicity. 
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Chapter  5 


GraphLab:  Generalizing  the  GrAD 
Methodology 


In  Chapter  3  and  Chapter  4  we  applied  the  GrAD  methodology  to  design  and  implement  efficient  parallel 
and  distributed  algorithms  for  probabilistic  reasoning.  In  this  chapter  we  generalize  the  thesis  approach  by 
introducing  GraphLab,  a  high-level  graph-parallel  framework  for  the  design  and  implementation  algorithms 
that  adhere  to  the  GrAD  methodology. 


5.1  Introduction 


With  the  exponential  growth  in  the  scale  and  complexity  of  structured  probabilistic  reasoning  in  Machine 
Learning  and  Data  Mining  (MLDM)  applications  there  is  an  increasing  need  for  systems  that  can  execute 
MLDM  algorithms  efficiently  on  multicore  machines  and  distributed  environments.  Simultaneously,  the 
availability  of  cloud  computing  services  like  Amazon  EC2  provide  the  promise  of  on-demand  access 
to  affordable  large-scale  computing  and  storage  resources  without  substantial  upfront  investments.  In 
the  previous  chapters  we  presented  the  design  and  implementation  two  MLDM  systems  for  probabilistic 
reasoning.  In  both  cases  we  had  to  explicitly  address  the  design  of  low-level  primitives  for  scheduling, 
mutual  exclusion,  data-placement,  and  distributed  coordination  in  addition  to  the  design  of  efficient 
algorithms. 

Unfortunately,  designing,  implementing,  and  debugging  the  efficient  parallel  and  distributed  MLDM 
algorithms  needed  to  fully  utilize  multicore  and  distributed  systems  can  be  extremely  challenging  requiring 
MLDM  experts  to  address  race  conditions,  deadlocks,  distributed  state,  and  communication  protocols  while 
simultaneously  developing  mathematically  complex  models  and  algorithms.  While  low  level  abstractions 
like  MPI  and  pthreads  provide  powerful,  expressive  primitives,  they  force  the  user  to  address  hardware 
issues  and  the  challenges  of  parallel  and  distributed  system  design. 

To  avoid  the  challenges  of  system  design,  many  have  turned  to  high-level  data-parallel  abstractions,  which 
dramatically  simplify  the  design  and  implementation  of  a  restricted  class  of  parallel  algorithms.  For 
example,  the  MapReduce  abstraction  [Dean  and  Ghemawat,  2004]  has  been  successfully  applied  to  a 
broad  range  of  ML  applications  [Chu  et  al.,  2006,  Panda  et  al.,  2009,  Wolfe  et  al.,  2008,  Ye  et  al.,  2009]. 
The  success  of  these  abstractions  derives  from  their  ability  to  restrict  the  types  of  algorithms  that  may 
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be  expressed  enabling  the  isolation  of  the  algorithm  from  the  systems  specific  challenges  and  forcing  the 
algorithm  designer  to  explicitly  expose  a  high  degree  of  parallelism.  However,  by  restricting  our  focus  to 
ML  algorithms  that  are  naturally  expressed  in  existing  data-parallel  abstractions,  we  are  often  forced  to 
make  overly  simplifying  assumptions.  Alternatively,  as  was  demonstrated  in  Chapter  3  by  coercing  efficient 
sequential  ML  algorithms  to  satisfy  the  restrictions  imposed  by  many  existing  abstraction,  we  often  produce 
inefficient  parallel  algorithms  that  require  many  processors  to  be  competitive  with  comparable  sequential 
methods. 

As  we  will  demonstrate  in  this  chapter,  the  efficient  MLDM  algorithms  that  follow  the  GrAD  methodology 
are  not  naturally  expressed  by  existing  high-level  data-parallel  abstractions.  As  a  consequence  many  [Asun¬ 
cion  et  al.,  2008,  Graf  et  al.,  2004,  Nallapati  et  al.,  2007,  Newman  et  al.,  2007,  Smola  and  Narayanamurthy, 
2010]  recent  large-scale  learning  systems  arc  built  upon  low-level  tools  and  targeted  at  individual  models 
and  applications.  This  time  consuming  and  often  redundant  effort  slows  the  progress  of  the  field  as  different 
research  groups  repeatedly  solve  the  same  parallel  and  distributed  computing  problems.  Therefore,  the 
MLDM  community  needs  a  high-level  distributed  abstraction  that  specifically  targets  the  asynchronous, 
dynamic,  graph-parallel  computation  found  in  many  MLDM  applications  while  hiding  the  complexities  of 
parallel  and  distributed  system  design. 

In  this  chapter  we  introduce  GraphLab,  a  new  graph-parallel  abstraction  that  explicitly  targets  MLDM 
algorithms  which  follow  the  GrAD  methodology.  GraphLab  enables  MLDM  experts  to  easily  design  and 
implement  efficient  scalable  parallel  and  distributed  MLDM  algorithms  by  composing  problem  specific 
computation,  data-dependencies,  and  scheduling. 

We  begin  by  first  providing  some  background  on  existing  abstractions  and  their  limitations  in  the  context  of 
the  GrAD  methodology.  We  then  present  the  Graph-Parallel  setting  and  introduce  the  GraphLab  abstraction. 
We  describe  how  the  abstraction  can  be  used  to  efficiently  express  a  range  of  MLDM  algorithms  including 
the  earlier  Belief  Propagation  and  Gibbs  sampling  algorithms.  We  then  describe  an  efficient  multicore 
implementation  of  the  GraphLab  abstraction  and  evaluate  our  implementation  by  using  it  to  run  several 
GrAD  algorithms  including  belief  propagation  and  Gibbs  sampling  as  well  as  several  new  algorithms 
designed  using  the  GraphLab  abstraction  and  the  GrAD  methodology.  We  then  extend  our  multicore 
implementation  to  the  distributed  (cloud)  setting  and  discuss  several  methods  used  to  achieve  serializability 
and  fault-tolerance.  Finally  we  evaluate  the  distributed  implementation  on  an  EC2  cluster  demonstrating 
order  of  magnitude  gains  of  published  results. 

The  main  contributions  of  this  chapter  are: 

•  A  summary  of  common  properties  of  MLDM  algorithms  (Section  5.2)  and  the  limitations  of  existing 
abstractions  frameworks  (Section  5.3). 

•  The  GraphLab  abstraction  (Section  5.4)  which  targets  algorithms  designed  using  the  GrAD  method¬ 
ology  and  is  composed  of: 

1.  A  data  graph  (Section  5.4. 1)  which  simultaneously  represents  data  and  computational  depen¬ 
dencies. 

2.  A  vertex  program  (Section  5.4.2)  representation  of  user  defined  computation  which  isolates 
the  algorithm  from  the  movement  of  data  and  enables  dynamic  scheduling. 

3.  A  simple  serial  execution  model  (Section  5.4.3)  which  exposes  a  high  degree  of  asynchronous 
parallelism. 
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(a)  Async  vs  Sync  PageRank  (b )  Dynamic  PageRank 


(c)  LoopyBP  Conv. 


\ Not  Serializable 


Updates 


(d)  ALS  Serializability 


Figure  5.1:  (a)  Rate  of  convergence,  measured  in  L\  error  to  the  true  PageRank  vector  versus  time,  of  the 
PageRank  algorithm  on  a  25M  vertex  355M  edge  web  graph  on  16  processors,  (b)  The  distribution  of  update 
counts  after  running  dynamic  PageRank  to  convergence.  Notice  that  the  majority  of  the  vertices  converged 
in  only  a  single  update,  (c)  Rate  of  convergence  of  Loopy  Belief  propagation  on  web-spam  detection, 
(d)  Comparing  serializable  and  non-serializable  (racing)  execution  of  the  dynamic  ALS  algorithm  in 
Section  5.8.1  on  the  Netflix  movie  recommendation  problem.  Non-serializable  execution  exhibits  unstable 
convergence  behavior. 


4.  A  range  of  schedulers  (Section  5.4.4)  which  enable  rich  dynamic  computation. 

5.  Strong  serializability  guarantees  (Section  5.4.5)  which  simplify  design,  analysis,  and  imple¬ 
mentation  of  GrAD  algorithms  and  preserves  the  simple  serial  execution  model. 

6.  A  concurrent  aggregation  framework  (Section  5.4.6)  which  allows  users  to  incrementally 
estimate  global  aggregates  of  the  graph. 

•  An  optimized  multicore  implementation  (Section  5.5)  of  the  GraphLab  abstraction  which  extends 
many  of  the  system  design  techniques  developed  in  Chapter  3  and  Chapter  4. 

•  An  optimized  distributed  implementation  (Section  5.7)  of  GraphLab  that  address  distributed  data 
graph  representation,  mutual  exclusion,  and  fault  tolerance. 

•  GraphLab  implementations  of  a  wide  range  of  GrAD  algorithms  including  parameter  learning  and 
inference  in  graphical  models,  Gibbs  sampling  as  well  as  several  new  algorithms  including  CoEM, 
Lasso,  compressed  sensing,  image  segmentation,  and  collaborative  filtering. 

•  An  experimental  evaluation  on  a  range  of  real-world  problems  using  both  the  multicore  (Section  5.6) 
and  distributed  (Section  5.8)  implementations  demonstrating  substantial  performance  gains  over 
existing  systems. 


5.2  Common  Patterns  in  MLDM 


The  GraphLab  abstraction  explicitly  targets  algorithms  that  follow  the  GrAD  methodology  and  effectively 
generalizes  the  approach  described  in  this  thesis.  In  this  section  review  the  key  properties  of  the  GrAD 
methodology  and  discuss  how  they  generalize  to  a  wide  range  MLDM  algorithms. 
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5.2.1  Sparse  Graph  Structured  Dependencies 


Many  of  the  recent  advances  in  MLDM  have  focused  on  modeling  the  dependencies  between  data.  By 
modeling  data  dependencies,  we  are  able  to  extract  stronger  signal  from  noisy  data.  For  example,  modeling 
the  dependencies  between  similar  shoppers  and  products  allows  us  to  make  better  product  recommendations 
than  treating  shoppers  in  isolation.  However,  directly  modeling  all  possible  relationships  between  dependent 
data  is  both  computationally  intractable  and  statistically  inefficient,  requiring  substantial  amounts  of  data 
to  resolve  the  form  of  each  relationship.  As  a  consequence,  efficient  MLDM  techniques  typically  reduce 
these  dependencies  to  a  sparse  graphical  structure.  For  example,  in  the  earlier  chapters  we  discussed  how 
statistical  dependencies  can  be  encoded  in  the  form  of  sparse  probabilistic  graphical  models. 

In  this  chapter  we  will  consider  the  more  general  case  in  which  dependencies  between  model  parameters 
and  data  are  encoded  in  the  form  of  a  graph.  Such  graphs  include  social  networks  which  relate  the  interests 
of  individuals,  term-document  co-occurence  graphs  which  connect  documents  with  the  key  terms  they 
contain,  and  even  physical  networks  such  as  those  used  to  represent  protein  structures  and  interactions. 
Identifying  and  exploiting  problem  specific  sparse  structure  is  a  common  pattern  in  MLDM  research  and 
we  will  assume  that  this  graph-structure  is  known  in  advance.  Alternatively,  techniques  for  resolving  the 
dependency  structure  typically  work  by  extending  upon  existing  structures. 

As  we  saw  in  the  earlier  chapters  the  algorithms  for  reasoning  about  these  dependences  (e.g.,  Belief 
Propagation  and  Gibbs  sampling)  operate  on  the  local  neighborhood  of  each  vertex  transforming  only 
a  small  context  of  the  larger  graph.  For  example,  the  conditional  distribution  of  each  random  variable 
in  a  large  statistical  model  only  depends  on  its  neighbors  in  the  Markov  Random  Field.  Alternatively, 
the  gradient  of  a  parameter  in  a  high-dimensional  optimization  problem  may  only  depend  on  a  small 
number  of  parameters  encoded  as  neighbors  in  a  graph.  This  explicit  locality  is  essential  to  enable 
parallel  and  distributed  computation  and  efficient  data  placement.  By  exploiting  latent  sparse  structure  in 
ML  problems,  we  can  develop  efficient  scalable  ML  algorithms  which  are  capable  expressing  rich  data 
dependencies. 

While  there  are  MLDM  methods  which  are  both  data-parallel  and  model  dense  data  dependencies  these 
methods  arc  not  typically  scalable  to  the  large  data  regime  and  often  exhibit  poor  generalization  behavior. 
For  instance  learning  the  complete  covariance  matrix  for  a  high  dimensional  data  is  fully  data-parallel 
procedure  which  is  also  unlikely  to  scale  to  extremely  large  problems  or  yield  strong  generalization 
performance.  Alternatively,  using  a  sparse  structured  approximation  (e.g.,  a  tree  graphical  model)  can 
dramatically  reduce  the  computational  complexity  while  improving  the  generalization  performance.  By 
explicitly  exploiting  sparse  dependencies  in  the  design  of  a  parallel  framework  we  obtain  efficient  parallel 
algorithms  that  scale. 


5.2.2  Asynchronous  Iterative  Computation 

Many  important  MLDM  algorithms  iteratively  update  a  large  set  of  parameters.  Because  of  the  underlying 
graph  structure,  parameter  updates  (on  vertices  or  edges)  depend  (through  the  graph  adjacency  structure) 
on  the  values  of  other  parameters.  Furthermore,  the  presence  of  cycles  in  the  graph  forces  most  algorithms 
to  search  for  a  fix-point  which  satisfies  all  the  relationships  encoded  along  the  edges.  For  example,  in  our 
work  on  belief  propagation  (see  Chapter  3)  we  iteratively  updated  edge  parameters  (messages)  until  the 
parameters  stopped  changing  by  more  than  some  small  e.  Alternatively,  in  our  work  on  Gibbs  sampling 
(see  Chapter  4),  we  iterate  until  we  have  obtained  a  sufficient  number  of  samples.  In  both  cases  we  found 
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that  the  order  in  which  parameters  arc  updated  plays  a  central  role  in  both  the  rate  of  convergence  as  well 
as  the  final  fixed-point  achieved. 

In  contrast  to  synchronous  systems,  which  update  all  parameters  simultaneously  (in  parallel)  using 
parameter  values  from  the  previous  time  step  as  input,  asynchronous  systems  update  parameters  as 
resources  become  available  and  using  the  most  recent  parameter  values  as  input.  In  Chapter  3  we  showed 
how  an  asynchronous  schedule  is  essential  to  rapid  stable  convergence  for  the  belief  propagation  algorithms. 
Alternatively,  in  Chapter  4  we  showed  how  a  synchronous  approach  can  lead  to  statistical  divergence  and 
consequently  the  failure  to  reach  the  correct  solution. 

More  generally,  by  using  the  most  recently  computed  parameter  values  and  respecting  the  local  sequential 
order  asynchronous  systems  can  more  efficiently  execute  a  wide  range  of  MLDM  algorithms.  For  example, 
linear  systems  (common  to  many  MLDM  algorithms)  have  been  shown  to  converge  faster  when  solved 
asynchronously  [Bertsekas  and  Tsitsiklis,  1989].  Additionally,  there  are  numerous  other  cases  such  as 
expectation  maximization  [Neal  and  Flinton,  1998],  and  stochastic  optimization  [Macready  et  al.,  1995, 
Smola  and  Narayanamurthy,  2010]  where  asynchronous  procedures  have  been  empirically  shown  to 
significantly  outperform  synchronous  procedures.  Even  classic  algorithms  such  as  PageRank  can  be 
substantially  accelerated  when  run  asynchronous  (see  Figure  5.1a). 

Furthermore  from  a  system  design  perspective,  synchronous  computation  incurs  costly  performance 
penalties  since  the  runtime  of  each  phase  is  determined  by  the  slowest  machine.  The  poor  performance 
of  the  slowest  machine  may  be  caused  by  a  multitude  of  factors  including:  load  and  network  imbalances, 
hardware  variability,  and  multi-tenancy  (a  principle  concern  in  the  Cloud).  Even  in  typical  cluster  settings, 
each  compute  node  may  also  provide  other  services  (e.g.,  distributed  file  systems).  Imbalances  in  the 
utilization  of  these  other  services  will  result  in  substantial  performance  penalties  if  synchronous  computation 
is  used. 


Serializability 


Unlike  synchronous  computation  which  has  a  well  defined  deterministic  execution  semantics,  asynchronous 
computation  can  behave  non-deterministically  especially  in  the  presence  of  data-dependencies.  By  ensuring 
that  all  parallel  executions  have  a  corresponding  sequential  execution,  serializability  eliminates  many  of 
the  challenges  associated  with  designing,  implementing,  and  testing  parallel  and  distributed  asynchronous 
algorithms.  Debugging  mathematical  code  in  a  concurrent  program  which  has  data-corruption  caused  by 
data  races  is  difficult  and  time  consuming.  In  addition,  many  algorithms  converge  faster  if  serializability  is 
ensured,  and  some  even  require  serializability  for  correctness.  For  example  our  work  in  Chapter  3  exploits 
serializability  along  trees  to  achieve  optimality.  Alternatively,  as  discussed  in  Chapter  4,  Gibbs  sampling 
requires  serializability  to  prove  statistical  correctness.  Finally,  some  algorithms  can  be  made  more  stable 
when  run  with  a  serializable  execution.  For  instance,  the  collaborative  filtering  algorithms  such  as  ALS 
(Section  5.8.1)  can  be  unstable  when  allowed  to  race  (Figure  5.  Id). 

However,  serializability  often  comes  at  a  cost  and  recently  Chafi  et  al.  [2011],  Jiayuan  et  al.  [2009] 
advocated  soft-optimization  techniques  (e.g.,  allowing  computation  to  intentionally  race),  but  we  argue 
that  such  techniques  do  not  apply  broadly  in  MLDM.  Even  for  the  algorithms  evaluated  by  Chafi  et  al. 
[2011],  Jiayuan  et  al.  [2009],  the  conditions  under  which  the  soft-optimization  techniques  work  are  not 
well  understood  and  may  fail  in  unexpected  ways  on  different  datasets  and  or  cluster  configurations. 
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Table  5.1:  Comparison  chart  of  large-scale  computation  frameworks,  (a)  Zaharia  et  al.  [2010]  de¬ 
scribes  and  iterative  extension  of  MapReduce,  (b)  Hindman  et  al.  [2009]  proposes  an  iterative  extension  for 
Dryad,  (c)  Piccolo  does  not  provide  a  mechanism  to  ensure  serializability  but  instead  exposes  a  mechanism 
for  the  user  to  attempt  to  recover  from  simultaneous  writes.  References:  MapReduce  [Dean  and  Ghemawat, 
2004],  Dryad  [Isard  et  al.,  2007],  Pregel  [Malewicz  et  al.,  2010],  BPGL  [Gregor  and  Lumsdaine,  2005], 
Piccolo  [Power  and  Li,  2010] 


5.2.3  Dynamic  Computation: 

In  many  MLDM  algorithms,  iterative  computation  converges  asymmetrically.  For  example,  in  parameter 
optimization,  often  a  large  number  of  parameters  will  quickly  converge  in  a  few  iterations,  while  the 
remaining  parameters  will  converge  slowly  over  many  iterations  Efron  et  al.  [2004],  Elidan  et  al.  [2006].  In 
Figure  5.1b  we  plot  the  distribution  of  updates  required  to  reach  convergence  for  PageRank.  Surprisingly, 
the  majority  of  the  vertices  required  only  a  single  update  while  only  about  3%  of  the  vertices  required  more 
than  10  updates.  Additionally,  prioritizing  computation  can  further  accelerate  convergence  as  demonstrated 
by  Zhang  et  al.  [201 1]  for  a  variety  of  graph  algorithms  including  PageRank.  If  we  update  all  parameters 
equally  often,  we  waste  time  recomputing  parameters  that  have  effectively  converged.  Conversely,  by 
focusing  early  computation  on  more  challenging  parameters,  we  can  potentially  accelerate  convergence. 
In  Figure  5.  lc  we  empirically  demonstrate  how  dynamic  scheduling  can  accelerate  convergence  of  loopy 
belief  propagation  (a  popular  MLDM  algorithm).  Likewise,  in  Chapter  4  we  showed  how  prioritization  can 
be  used  to  accelerate  the  construction  of  high-likelihood  samples. 

Unfortunately,  the  role  of  parallel  schedule  for  many  algorithms  is  not  yet  well  studied  largely  because 
existing  tools  do  not  readily  support  the  evaluation  of  new  scheduling  techniques.  A  successful  paral¬ 
lel  abstraction  must  provide  the  ability  to  support  a  wide  variety  of  schedules  and  facilitate  schedule 
comparison. 


5.3  High-Level  Abstractions 


In  this  section  we  review  several  existing  high-level  abstractions  for  designing  and  implementing  parallel 
and  distributed  algorithms  and  discuss  their  limitations  in  the  context  GrAD  principles.  To  better  organize 
their  presentation  we  have  attempted  to  divide  the  abstractions  across  three  general  forms  of  parallelism 
however  in  reality  individual  implementations  may  blend  different  forms  of  parallelism.  Furthermore, 
while  we  broadly  characterize  properties  of  each  abstraction  is  common  that  instantiations  deviate  from  the 
general  abstraction  and  potentially  address  some  of  the  issues  we  discuss. 
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Figure  5.2:  A  typical  MapReduce  operation,  but  drawn  using  “Update”  functions  which  transforms  the 
Map  input  data  in-place;  instead  of  the  canonical  data  flow  model  where  the  Map  operation  writes  its  results 
to  some  output  stream. 

5.3.1  Data-Parallel  Abstractions 

Data  parallel  abstractions  run  the  same  computation  on  independent  sets  of  data  in  parallel.  Data  parallel 
abstractions  are  among  the  most  widely  used  and  include  frameworks  ranging  from  MapReduce  to  the 
general  data-flow  systems  that  form  the  backend  modem  relational  databases.  However,  data-parallel 
abstractions  typically  struggle  to  efficiently  address  fixed  computational  dependencies  and  the  more 
sophisticated  asynchronous  dynamics  that  form  the  foundation  of  the  GrAD  methodology. 


Map-Reduce 

One  of  the  most  widely  used  high-level  distributed  programming  abstractions  is  MapReduce.  A  program 
implemented  in  the  MapReduce  abstraction  (Figure  5.2)  consists  of  a  Map  operation  and  a  Reduce 
operation.  The  Map  operation  is  a  function  which  is  applied  independently  and  in  parallel  to  each 
datum  (e.g.,  webpage)  in  a  large  data  set  (e.g.,  computing  the  word-count).  The  Reduce  operation  is  an 
aggregation  function  which  combines  the  Map  outputs  (e.g.,  computing  the  total  word  count).  MapReduce 
performs  optimally  only  when  the  algorithm  is  embarrassingly  parallel  and  can  be  decomposed  into  a  large 
number  of  independent  computations.  The  MapReduce  framework  expresses  the  class  of  ML  algorithms 
which  fit  the  Statistical-Query  model  [Chu  et  al.,  2006]  and  problems  where  feature  extraction  dominates 
the  run-time. 

The  MapReduce  abstraction  fails  when  there  are  computational  dependencies  in  the  data.  For  example, 
MapReduce  can  be  used  to  extract  features  from  a  massive  collection  of  images  but  cannot  represent 
computation  that  depends  on  small  overlapping  subsets  of  images.  This  critical  limitation  makes  it  difficult 
to  represent  algorithms  that  operate  on  structured  models.  As  a  consequence,  when  confronted  with 
large  scale  problems,  we  often  abandon  rich  structured  models  in  favor  of  overly  simplistic  methods 
that  are  amenable  to  the  MapReduce  abstraction.  Alternatively,  we  could  try  to  transform  algorithms 
with  computational  dependencies  into  data-parallel  valiants  where  computational  dependencies  have  been 
removed.  Transforming  MLDM  algorithms  with  computational  dependencies  into  the  data-parallel  form 
needed  for  the  map-reduce  abstraction  can  introduce  substantial  algorithmic  inefficiency  (e.g..  Synchronous 
BP)  and  in  some  cases  lead  to  divergent  computation  (e.g..  Synchronous  Gibbs). 

While  the  MapReduce  abstraction  can  be  invoked  iteratively,  it  does  not  provide  a  mechanism  to  directly 
encode  iterative  computation.  Furthermore,  to  achieve  fault-tolerance  the  MapReduce  abstraction  must 
redundantly  save  the  output  to  a  distributed  filesystem  after  each  iteration.  As  a  consequence,  applying  the 
MapReduce  framework  to  iterative  problems  with  relatively  small  local  computations,  a  property  common 
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to  many  ML  algorithms,  can  be  extremely  costly.  Recent  projects  such  as  Twister  [Ekanayake  et  ah,  2010] 
and  Spark  [Zaharia  et  ah,  2010]  extend  MapReduce  to  the  iterative  setting  and  address  the  costly  overhead 
of  disk-based  fault-tolerance  by  introducing  novel  recovery  mechanisms.  However,  these  abstractions  are 
constrained  to  fully  synchronous  computation  and  arc  limited  in  their  ability  to  express  rich  asynchronous 
and  dynamic  scheduling. 


Dataflow 

Dataflow  abstractions  such  as  Dryad  Isard  et  al.  [2007]  and  Hyracks  Borkar  et  al.  [201 1]  generalize  the 
Map-Reduce  abstraction  and  represent  parallel  computation  as  a  directed  (typically  acyclic)  graph  with  data 
flowing  along  edges  between  vertices  which  correspond  to  functions  which  receive  information  on  inbound 
edges  and  output  results  to  outbound  edges.  Dataflow  abstractions  are  ideal  for  processing  streaming 
data. 

While  most  dataflow  abstraction  permits  rich  computational  dependencies  through  sophisticated  join 
operations  they  are  not  directly  aware  of  the  graph  dependency  structure  in  advance.  As  a  consequence 
data-flow  abstractions  can  easily  handle  changes  in  the  dependency  structure  but  are  not  optimized  to 
take  advantage  of  the  locality  inherent  in  a  static  dependency  graph.  In  many  cases  such  join  operations 
may  require  substantial  movement  of  data.  Furthermore,  with  the  exception  of  recursive  dataflow  systems 
such  as  Datalog  [Ceri  et  al.,  1989],  most  data-flow  systems  do  not  directly  support  iterative  or  dynamic 
computation. 


5.3.2  Process-Parallel  Abstractions 

In  contrast  to  data-parallel  abstractions,  process-parallel  abstractions  divide  computation  along  separate 
processes  that  interact  through  messages  or  shared  state.  These  include  low  level  primitives  like  pthreads 
and  MPI  as  well  as  slightly  higher-level  abstractions  based  on  actor  models  [Hewitt  et  al.,  1973]  like 
Scala  Akka  or  Erlang  [Virding  et  al.,  1996]  and  distributed  key-value  models  like  Piccolo  [Power  and  Li, 
2010], 


Actor  Models 

The  actor  model  was  first  introduced  by  Hewitt  et  al.  [1973]  and  encompasses  a  very  broad  class  of  process 
parallel  abstractions.  The  actor  abstraction  is  generally  thought  of  as  fully  asynchronous  message  based 
abstraction  in  which  processes  (actors)  communicate  by  sending  and  receiving  messages  and  can  spawn 
other  actors.  There  arc  many  implementations  of  actor  models  ranging  from  programming  languages  such 
as  Erlang  to  actively  developed  public  APIs  such  as  Akka  for  Scala. 

While  the  actor  model  is  in  many  ways  well  suited  to  express  dynamic  asynchronous  computation,  it  does 
not  provide  a  mechanism  to  ensure  serializability  and  forces  the  user  to  directly  manage  the  movement 
of  information  through  messages.  This  makes  it  difficult  to  reason  about  joint  state  of  multiple  actors. 
Furthermore,  these  abstractions  typically  do  not  allow  for  prioritized  computation  and  instead  instantiate 
actors  as  processing  resources  become  available.  Finally,  because  the  actor  model  does  not  impose 
restrictions  on  the  communication  patterns  it  is  unable  to  isolating  load  balancing  and  distributed  data 
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Figure  5.3:  The  Bulk  Synchronous  Parallel  Model  comprises  of  interleaved  fully  parallel  “compute  phases” 
and  “communication  phases”,  where  each  phase  is  followed  by  a  global  synchronization  barrier. 


management  from  the  algorithm  design  forcing  the  algorithm  to  address  both  the  ML  and  systems  problem 
simultaneously. 


Piccolo 


The  Piccolo  abstraction,  introduced  by  Power  and  Li  [2010],  exposes  the  program  state  as  a  large  distributed 
key-value  store  and  synchronously  executes  computational  kernels  in  parallel  which  transform  the  key-value 
store.  In  contrast  to  the  actor  abstraction,  Piccolo  kernels  interact  through  shared  state.  In  addition  the 
Piccolo  abstraction  exposes  a  clever  commutative  associative  transaction  model  that  allows  for  concurrent 
transactions  on  a  single  key-value  pair  to  be  combined.  However,  Piccolo  does  not  explicitly  support  graph 
structures  and  as  a  consequence  locality  is  achieved  through  user  defined  placement  functions. 

Piccolo  implements  the  Bulk  Synchronous  Parallel  (BSP)  model  of  communication  (Figure  5.3),  introduced 
by  Valiant  [1990],  forces  computation  to  be  decomposed  into  small  atomic  components  (like  the  Map 
in  MapReduce),  but  allows  for  phased  communication  between  the  components  and  naturally  expresses 
iterative  algorithms  with  computational  dependencies.  The  BSP  abstraction  divides  an  algorithm  into 
super-steps  each  consisting  of  two  phases.  During  the  first  phase,  each  processor  executes  independently 
(like  the  Map  in  Map-Reduce).  In  the  second  phase,  all  processors  exchange  messages.  Finally,  a  bander 
synchronization  is  performed  before  the  next  iteration,  ensuring  all  processors  compute  and  communicate 
in  lockstep.  While  the  BSP  model  is  easier  to  reason  about  than  the  fully  asynchronous  actors,  it  cannot 
efficiently  express  adaptive  asynchronous  computation. 


5.3.3  Graph-Parallel  Abstractions 


The  need  for  high-level  frameworks  that  specifically  targets  graph  computation  has  lead  to  the  emergence 
of  graph-parallel  abstractions.  A  graph-parallel  abstraction  consists  of  a  sparse  graph  G  =  { V.  E } 
and  a  vertex-program  Q  which  is  executed  in  parallel  on  each  vertex  v  €  V  and  can  interact  (e.g., 
through  shared-state  or  messages)  with  neighboring  instances  Q(u)  where  (it,  v)  6  E.  In  contrast  to  more 
general  message  passing  and  actor  models  models,  graph-parallel  abstractions  constrain  the  interaction  of 
vertex-program  to  a  graph  structure  enabling  the  optimization  of  data-layout  and  communication. 


92 


Pregel 


Malewicz  et  al.  [2010]  introduced  the  Pregel  bulk  synchronous  message  passing  abstraction  in  which 
all  vertex-programs  run  simultaneously  in  a  sequence  of  super-steps.  Within  a  super-step  each  program 
instance  Q(v)  receives  all  messages  from  the  previous  super-step  and  sends  messages  to  its  neighbors  in 
the  next  super-step.  A  barrier  is  imposed  between  super-steps  to  ensure  that  all  program  instances  finish 
processing  messages  from  the  previous  super-step  before  proceeding  to  the  next.  The  program  terminates 
when  there  are  no  messages  remaining  and  every  program  has  voted  to  halt.  Pregel  introduces  commutative 
associative  message  combiners  which  are  user  defined  functions  that  merge  messages  destined  to  the  same 
vertex.  These  message  combiners  behave  similarly  to  the  commutative  associative  transactions  within 
the  Piccolo  abstraction.  In  many  ways  the  Pregel  abstraction  is  a  generalization  of  the  classic  Systolic 
array  [Shapiro,  1988].  While  systolic  arrays  focused  on  a  fixed  often  physical  network  topology  with 
processors  at  each  node,  Pregel  virtualizes  each  processor  as  vertex -program  and  allows  for  arbitrary  graph 
topologies. 

The  following  is  an  example  of  the  PageRank  vertex-program  implemented  in  Pregel.  The  vertex-program 
receives  the  single  incoming  message  (after  the  combiner)  which  contains  the  sum  of  the  PageRanks  of  all 
in-neighbors.  The  new  PageRank  is  then  computed  and  sent  to  its  out-neighbors. 


Message  combiner (Message  ml.  Message  m2)  : 

return  Message (ml . value ( )  +  m2. value ()); 
void  PregelPageRank (Message  msg)  : 
float  total  =  msg. value (); 
vertex. val  =  0.15  +  0.85*total; 
foreach(nbr  in  out_neighbors )  : 

SendMsg(nbr,  vertex . val/num_out_nbrs ) ; 


One  of  the  key  limitations  of  the  Pregel  abstraction  is  that  the  user  must  architect  the  movement  of  program 
state  between  vertex -programs  through  messages.  For  example,  if  we  consider  the  above  PageRank  example 
we  see  that  each  vertex  generates  a  separate  but  identical  message  which  is  sent  to  each  of  its  neighbors.  If 
each  of  its  neighbors  is  physically  mapped  to  a  separate  machine  this  might  be  reasonable.  However,  if 
many  of  its  neighbors  are  on  the  same  machine  this  could  lead  to  substantial  additional  network  overhead. 
We  will  return  to  the  limitations  of  messaging  abstraction  throughout  this  and  the  subsequent  chapter. 

While  Pregel  improves  upon  Piccolo  by  introducing  the  graph  structure  it  retains  the  synchronous  execution 
semantics.  As  a  consequence,  we  can  naturally  express  Synchronous  BP  in  Pregel  but  we  cannot  to  express 
the  more  asynchronous  variants  of  BP.  Moreover,  the  Pregel  abstraction  is  not  designed  for  dynamic 
computation  and  specialized  structured  schedules  which  are  central  to  the  GrAD  methodology.  As  we 
have  already  demonstrated,  by  skipping  unnecessary  vertex-programs  and  by  executing  vertex-programs 
along  spanning  trees,  we  can  reduce  wasted  computation  and  move  information  across  the  graph  more 
efficiently. 


5.4  The  GraphLab  Abstraction 


At  a  high-level,  GraphLab  is  an  asynchronous  distributed  shared-memory  graph-parallel  abstraction  in 
which  vertex  programs  have  shared  access  to  information  on  adjacent  edges  and  vertices  and  can  schedule 
neighboring  vertex-programs  to  be  executed  in  the  future.  By  targeting  GrAD  methodology,  GraphLab 
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achieves  a  balance  between  low-level  and  high-level  abstractions.  Unlike  many  low-level  abstractions 
(e.g.,  MPI,  PThreads),  GraphLab  insulates  users  from  the  complexities  of  synchronization,  data  races 
and  deadlocks  by  providing  a  high  level  data  representation  through  the  data  graph  and  automatically 
maintained  serializability  guarantees  through  configurable  exclusion  policies.  Unlike  many  high-level 
abstractions  (i.e.,  MapReduce),  GraphLab  can  express  complex  computational  dependencies  using  the  data 
graph  and  provides  sophisticated  scheduling  primitives  which  can  express  iterative  parallel  algorithms  with 
dynamic  scheduling. 

The  GraphLab  abstraction  consists  of  three  main  parts,  the  data  graph,  the  vertex-program,  and  the  sync 
operation.  The  data  graph  (Section  5.4.1)  represents  user  modifiable  program  state,  and  stores  both 
the  mutable  user-defined  data  and  encodes  the  sparse  computational  dependencies.  The  vertex-program 
(Section  5.4.2)  represents  the  user  computation  and  operate  on  the  data  graph  by  transforming  data  in  small 
overlapping  contexts  called  scopes.  Finally,  the  sync  operation  (Section  5.4.6)  concurrently  maintains 
global  aggregates.  To  ground  the  GraphLab  abstraction  in  a  concrete  problem,  we  will  use  the  PageRank 
algorithm  Page  et  al.  [1999]  as  a  running  example. 

Example  5.4.1  (PageRank).  The  PageRank  algorithm  recursively  defines  the  rank  of  a  webpage  v: 

R(v)  =  —  +  (1  —  7)  y2  Wu,vxR{u)  (5.1) 

n 

u  links  to  v 

in  terms  of  the  weighted  wUjV  ranks  R(u )  of  the  pages  u  that  link  to  v  as  well  as  some  probability  7  of 
randomly  jumping  to  a  new  page  (typically  7  =  0.15).  The  PageRank  algorithm  iterates  Eq.  (5.1 )  until  the 
PageRank  changes  by  less  than  some  small  value  e. 


5.4.1  Data  Graph 

The  GraphLab  abstraction  stores  the  program  state  as  a  directed  graph  called  the  data  graph.  The  data 
graph  G  =  (V,  E.  D)  is  a  container  that  manages  the  user  defined  data  D.  Here  we  use  the  term  data 
broadly  to  refer  to  model  parameters,  algorithm  state,  and  even  statistical  data.  Users  can  associate 
arbitrary  data  with  each  vertex  {Dv  :  v  G  V)  and  edge  { T)„->v  '■  {'«,  u}  £  E}  in  the  graph.  However,  as 
the  GraphLab  abstraction  does  not  depend  on  edge  directions,  we  also  use  Du^v  to  denote  the  data  on 
both  edge  directions  u  — >  v  and  v  a.  Since  some  MLDM  applications  require  directed  edge  data  (e.g., 
weights  on  directed  links  in  a  web-graph),  we  provide  the  ability  to  store  and  retrieve  data  associated  with 
each  direction  along  an  edge.  Finally,  while  the  graph  data  is  mutable,  the  structure  is  static  and  cannot  be 
changed  during  execution. 

Example  5.4.2  (PageRank:  Example  5.4.1).  The  data  graph  is  directly  obtained  from  the  web  graph,  where 
each  vertex  corresponds  to  a  web  page  and  each  edge  represents  a  link.  The  vertex  data  Dv  stores  R(v), 
the  current  estimate  of  the  PageRank,  and  the  edge  data  Du^v  stores  wUjV,  the  directed  weight  of  the  link. 

The  data  graph  naturally  represents  the  state  of  a  wide  range  of  machine  learning  algorithms.  For  example 
in  belief  propagation,  the  data  graph  is  the  pairwise  MRF,  with  the  vertex  data  Dv  to  storing  the  node 
potentials  and  the  directed  edge  data  Du^v  storing  the  BP  messages.  If  the  MRF  is  sparse  then  the  data 
graph  is  also  sparse  and  GraphLab  will  achieve  a  high  degree  of  parallelism.  The  data  graph  can  also  be 
used  to  represent  many  algorithms  that  do  not  explicitly  operate  on  graph  structures.  For  example,  the 
shooting  algorithm  (coordinate-wise  sub-gradient  descent)  for  the  Lasso  can  be  transformed  into  a  data 
graph  by  connecting  each  weight  with  examples  that  have  non-zero  feature  values  for  that  weight. 
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Scope  S1 


Figure  5.4:  The  data  graph  and  scope  <Si.  Gray  cylinders  represent  the  user  defined  vertex  and  edge 
data  while  the  irregular  region  containing  the  vertices  {1,  2,  3, 4}  is  the  scope,  of  vertex  1.  An  vertex 
program  applied  to  vertex  1  is  able  to  read  and  modify  all  the  data  in  5i  (vertex  data  D i,  D2,  D3,  and  D4 
and  edge  data  D io2,  £>1^.3,  and  D  1^.4). 


As  we  will  show  in  subsequent  sections,  the  explicit  data  graph  representation  of  the  program  state  also 
enables  the  GraphLab  runtime  to  effectively  place  the  entire  program  state  in  a  distributed  environment, 
automatically  enforce  powerful  serializability  models,  and  even  achieve  efficient  fault-tolerance. 


5.4.2  GraphLab  Vertex  Programs 

Computation  is  encoded  in  the  GraphLab  abstraction  in  the  form  of  a  user  defined  vertex  program.  A 
GraphLab  vertex  program  is  a  stateless  procedure  that  modifies  the  data  within  the  scope  of  a  vertex 
and  schedules  the  future  execution  of  vertex  programs  on  neighboring  vertices.  The  scope  of  vertex  v 
(denoted  by  Sv)  is  the  data  stored  on  v,  as  well  as  the  data  stored  on  all  adjacent  vertices  and  adjacent  edges 
(Figure  5.4).  A  GraphLab  vertex  program  is  analogous  to  the  map  function  in  MapReduce,  but  unlike  in 
MapReduce,  vertex  programs  are  permitted  to  access  and  modify  overlapping  contexts  in  the  graph. 

A  GraphLab  vertex  program  takes  as  an  input  a  vertex  v  and  its  scope  Sv  and  returns  the  new  versions  of 
the  data  in  the  scope  as  well  as  a  set  vertices  T : 

Vprog  :  f{v,Sv )  ->  (SV,T) 

After  executing  a  vertex  program  the  modified  data  in  Sv  is  written  back  to  the  data  graph.  We  refer  to  the 
neighborhood  Sv  as  the  scope  of  v  because  Sv  defines  the  extent  of  the  graph  that  can  be  accessed  by  the 
vertex  program  /  when  applied  to  v.  The  set  of  vertices  T  arc  eventually  executed  by  applying  the  vertex 
program  f(u,  Su)  for  all  u  €  T  following  the  execution  semantics  described  later  in  Section  5.4.3. 

Rather  than  adopting  a  message  passing  (e.g..  Pregel)  or  data  flow  (e.g..  Dryad)  model,  GraphLab  allows 
the  user  defined  vertex  programs  complete  freedom  to  read  and  modify  any  of  the  data  on  adjacent  vertices 
and  edges.  This  simplifies  user  code  and  eliminates  the  need  for  the  users  to  orchestrate  the  movement 
of  data.  This  allows  the  implementation  of  the  system  to  choose  when  and  how  to  move  vertex  and  edge 
data  in  a  distributed  environment.  By  controlling  what  vertices  are  returned  in  T  and  thus  are  eventually 
executed,  GraphLab  vertex  programs  can  efficiently  express  adaptive  computation.  For  example,  a  vertex 
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program  may  choose  to  return  (schedule)  its  neighbors  only  when  it  has  made  a  substantial  change  to  its 
local  data. 

There  is  an  important  difference  between  Pregel  and  GraphLab  in  how  dynamic  computation  is  expressed. 
GraphLab  decouples  the  scheduling  of  future  computation  from  the  movement  of  data.  As  a  consequence, 
GraphLab  vertex  programs  have  access  to  data  on  adjacent  vertices  even  if  the  adjacent  vertices  did  not 
schedule  the  current  execution.  Conversely,  Pregel  vertex  programs  are  initiated  by  messages  and  can  only 
access  the  data  in  the  messages  sent  during  the  previous  super-step.  This  restriction  limits  what  can  be 
expressed  and  explicitly  couples  program  dynamics  to  the  user  orchestrated  movement  of  data. 

For  instance,  dynamic  PageRank  is  difficult  to  express  in  Pregel  since  the  PageRank  computation  for  a 
given  page  requires  the  PageRank  values  of  all  adjacent  pages  even  if  some  of  the  adjacent  pages  have  not 
recently  changed.  Therefore,  the  decision  to  send  data  (PageRank  values)  to  neighboring  vertices  cannot 
be  made  by  the  sending  vertex  (as  required  by  Pregel)  but  instead  must  be  made  by  the  receiving  vertex. 
GraphLab,  naturally  expresses  the  pull  behavior,  since  adjacent  vertices  are  only  responsible  for  scheduling, 
and  vertex  programs  can  always  directly  read  adjacent  vertex  values  even  if  they  have  not  changed. 

Example  5.4.3  (PageRank:  Example  5.4.1).  The  vertex  program  for  PageRank  ( defined  in  Alg.  5.1) 
computes  a  weighted  sum  of  the  current  ranks  of  neighboring  vertices  and  assigns  it  as  the  rank  of  the 
current  vertex.  The  algorithm  is  adaptive:  neighbors  are  scheduled  for  execution  only  if  the  value  of  current 
vertex  changes  by  more  than  a  predefined  threshold. 


Algorithm  5.1:  PageRank  Vertex  Program  in  the  GraphLab  Abstraction 
Input:  Vertex  data  R(i>)  from  Sv 
Input:  Edge  data  {wUtV  :  u  G  A/),}  from  Sv 
Input:  Neighbor  vertex  data  {R(tt)  :  u  €  A/”t,}  from  Sv 
Roid(r’)  t—  R(u)  //  Save  old  PageRank 

R(t>)  t—  7 /n 

foreach  u  G  Afv  do  //  Loop  over  neighbors 

|_  R(f) «-  R(y)  +  (1  -  7)  *  wu,v  *  R(«) 

//  If  the  PageRank  changes  sufficiently 
if  | R(v)  -  Ra/d(v) |  >  e  then 

//  Schedule  neighbors  to  be  executed 
return  {tt:«G  A fv} 

Output:  Modified  scope  Sv  with  new  R(t>) 


Algorithms  that  follow  the  GrAD  principles  can  naturally  be  expressed  as  GraphLab  vertex  programs.  For 
example,  probabilistic  inference  algorithms  discussed  in  earlier  chapters  can  directly  translated  into  vertex 
programs.  Like  wise  other  classical  algorithms  such  as  expectation  propagation  Minka  [2001]  and  mean 
field  variational  inference  Xing  et  al.  [2003]  can  all  be  expressed  using  vertex  programs  which  read  the 
current  assignments  to  the  parameter  estimates  on  neighboring  vertices  and  edges  and  then  apply  sampling 
or  optimization  techniques  to  update  parameters  on  the  local  vertex. 


5.4.3  The  GraphLab  Execution  Model 

The  GraphLab  execution  model,  presented  in  Alg.  5.2,  follows  a  simple  single  loop  semantics.  The  input  to 
the  GraphLab  abstraction  consists  of  the  data  graph  G  =  ( [V ,  E,  D),  a  vertex  program,  and  an  initial  active 
set  of  vertices  T  to  be  executed.  While  there  arc  vertices  remaining  in  T,  the  algorithm  removes  (Line  1) 
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Algorithm  5.2:  GraphLab  Execution  Model 

Input:  Data  Graph  G  =  {V,  E,  D) 

Input:  Initial  vertex  set  T  =  { i'i ,  v-> , ...} 
while  T  is  not  Empty  do 

1  v  <r-  RemoveNext  (7~ ) 

2  (T',SV)  4-  f(v,Sv) 

3  [  r^rur 

Output:  Modified  Data  Graph  G  =  (V,  E,  I)') 


and  executes  (Line  2)  vertices,  adding  any  new  vertices  back  into  T  (Line  3).  The  active  set  T  behaves  like 
a  classic  set  and  therefore  contains  at  most  one  instance  of  each  vertex.  As  a  consequence  if  a  vertex  is 
added  more  than  once  before  being  executed,  then  it  is  executed  only  once.  The  resulting  data  graph  is 
returned  to  the  user  on  completion. 


5.4.4  Scheduling  Vertex  Programs 

The  order  in  which  vertices  arc  returned  from  RemoveNext  (T)  (Line  1)  is  determined  by  the  GraphLab 
scheduler.  GraphLab  provides  a  collection  of  schedulers  each  with  different  behaviors.  While  all  schedulers 
must  ensure  that  all  vertices  in  the  active  set  T  are  eventually  executed,  each  scheduler  is  free  to  choose  an 
order  that  optimizes  system  performance,  parallelization,  and  algorithm  convergence. 

The  GraphLab  framework  provides  three  fundamental  schedulers  which  are  capable  of  expressing  a  wide 
range  of  computation.  The  simplest  of  these  schedulers  is  the  sweep  scheduler  which  always  executes 
vertices  in  T  in  a  canonical  order.  Typically  the  canonical  order  is  either  a  random  permutation  of  the 
vertices  or  a  specifically  chosen  order  based  on  properties  of  the  vertices  or  the  structure  of  the  graph. 
The  sweep  scheduler  is  ideally  suited  for  algorithms  like  Gibbs  sampling  which  repeatedly  iterate  over  all 
vertices. 

As  we  observed  in  our  earlier  work  in  graphical  model  inference,  it  is  often  advantageous  to  schedule 
computation  along  a  wavefront.  This  can  be  accomplished  by  using  the  FiFo  scheduler  which  executes 
vertices  in  the  order  in  which  they  are  added.  The  FiFo  scheduler  preserves  the  set  semantics  of  T  and 
so  repeated  additions  of  the  same  vertex  are  ignored  until  the  vertex  leaves  the  FiFo  schedule.  As  a 
consequence  the  FiFo  schedule  effectively  maintains  a  frontier  set  and  enables  a  bread-first  traversal  of  the 
graph. 

While  the  presence  of  the  active  set  T  of  vertices  automatically  leads  to  adaptive  computation,  we 
demonstrated  in  our  earlier  work  (Chapter  3  and  Chapter  4)  that  dynamic  prioritization  can  further 
accelerate  the  convergence.  Therefore  GraphLab  provides  a  priority  scheduler  which  associates  a  priority 
with  each  vertex  and  returns  the  active  vertices  in  order  of  priority.  If  a  vertex  is  added  more  than  once  than 
its  highest  priority  value  is  used  and  so  vertex  programs  can  promote  the  priority  of  active  vertices.  The 
priority  scheduler  is  used  to  implement  algorithms  like  residual  belief  propagation. 

In  addition  to  the  three  basic  schedules,  we  also  have  implemented  several  special  purpose  schedulers.  For 
example,  motivated  by  our  work  on  the  Splash  BP  and  Splash  Gibbs  sampling  algorithms  we  developed 
a  specialized  Splash  scheduler  which  automatically  constructs  bounded  work  Splashes.  We  believe  that 
work  in  improved  scheduling  heuristics  can  lead  to  more  efficient  parallel  algorithms.  As  a  consequence 
GraphLab  makes  it  easy  for  users  to  compose  different  schedulers  with  the  same  vertex  program. 
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(b)  Serializability  and  Parallelism 


Figure  5.5:  (a)  The  read  and  write  permissions  for  a  vertex  program  executed  on  vertex  3  under  each 
of  the  exclusion  policies.  Under  the  scope  exclusion  policy  the  vertex  program  has  complete  read-write 
access  to  its  entire  scope.  Under  the  edge  exclusion  policy,  the  vertex  program  has  only  read  access  to 
adjacent  vertices.  Finally,  vertex  exclusion  policy  only  provides  write  access  to  the  central  vertex  data,  (b) 
The  trade-off  between  serializability  and  parallelism.  The  dark  rectangles  denote  the  write-locked  regions 
that  cannot  overlap.  Vertex  program  are  executed  on  the  dark  vertices  concurrently.  Under  the  stronger 
serializability  models  fewer  functions  can  run  concurrently. 


When  executed  in  a  distributed  environment  each  scheduler  can  optionally  relax  its  scheduling  guarantees 
to  improve  system  performance.  For  example,  in  the  distributed  setting  each  scheduler  only  maintain  its 
scheduling  invariant  with  respect  to  vertices  explicitly  assigned  to  that  machine.  Flowever,  regardless  of 
the  parallel  or  distributed  execution  each  scheduler  must  always  maintain  the  invariant  that  all  schedule 
vertices  arc  executed  eventually. 


5.4.5  Ensuring  Serializable  Execution 

The  GraphLab  abstraction  presents  an  inherently  sequential  model  of  computation  which  is  automatically 
translated  into  a  parallel  execution  by  allowing  multiple  processors  to  execute  the  same  loop  on  the  same 
graph,  removing  and  executing  different  vertices  concurrently.  To  retain  the  sequential  execution  semantics 
we  prevent  overlapping  computation  from  running  concurrently.  Flowever,  the  extent  to  which  computation 
can  safely  overlap  depends  on  the  user  defined  computation  within  the  vertex  program  as  well  as  the 
stability  of  the  underlying  algorithm.  We  therefore  introduce  several  exclusion  policies  that  allow  the 
runtime  to  optimize  the  parallel  execution  while  maintaining  serializability. 

The  GraphLab  runtime  ensures  a  serializable  execution.  A  serializable  execution  implies  that  there  exists 
a  corresponding  serial  schedule  of  vertex  programs  that  when  executed  sequentially  following  Alg.  5.2 
produces  the  same  values  in  the  data-graph. 

Definition  5.4.1  (GraphLab  Serializability  Guarantee).  For  every  parallel  execution  of  the  GraphLab  ab¬ 
straction,  there  exists  a  sequential  ordering  on  the  vertices  such  that  vertex  programs  executed  sequentially 
according  to  that  ordering  produce  the  same  data  graph. 

By  ensuring  serializability,  GraphLab  simplifies  reasoning  about  highly-asynchronous  dynamic  computation 
in  the  distributed  setting.  For  example,  in  a  serializable  execution  it  is  not  possible  for  adjacent  vertices 
to  run  concurrently.  As  a  consequence,  if  an  invariant  over  the  neighborhood  of  a  vertex  is  true  at  the 


98 


start  of  the  vertex  program  it  will  remain  true  until  after  the  vertex  program  completes.  Alternatively,  a 
non- serializable  execution  would  make  it  extremely  difficult  to  maintain  neighborhood  invariants. 

Serializability  is  achieved  through  three  different  mutual  exclusion  policies  which  prevent  vertex  programs 
with  overlapping  state  from  executing  concurrently  (see  Figure  5.5a).  The  strongest  policy  is  the  scope 
exclusion  policy  which  prevents  vertex  programs  with  overlapping  scopes  from  executing  concurrently.  As 
a  consequence  the  scope  exclusion  policy  ensures  that  during  the  execution  of  the  vertex  program  on  the 
vertex  v  no  other  function  will  read  or  modify  data  within  Sv.  However,  the  scope  exclusion  policy  limits 
the  potential  parallelism  since  concurrently  executing  vertices  cannot  share  any  common  neighbors  (i.e., 
must  be  at  least  two  vertices  apart)  as  illustrated  in  Figure  5.5b. 

The  vast  majority  of  vertex  programs  only  read  the  values  of  adjacent  vertices.  In  this  case  we  can  permit 
vertices  with  common  neighbors  to  run  concurrently  and  still  ensure  serializability.  For  instance,  the 
PageRank  vertex  program  only  requires  read  access  to  neighboring  vertices.  To  provide  greater  parallelism 
while  retaining  serializability,  GraphLab  defines  the  edge  exclusion  policy.  The  edge  exclusion  policy 
ensures  each  vertex  program  has  exclusive  read-write  access  to  its  vertex  and  adjacent  edges  but  read  only 
access  to  adjacent  vertices  as  illustrated  in  Figure  5.5a).  As  a  consequence,  the  edge  exclusion  policy 
increases  parallelism  by  allowing  vertex  programs  with  slightly  overlapping  scopes  to  safely  run  in  parallel 
(see  Figure  5.5b). 

The  vertex  exclusion  policy  is  the  weakest  of  the  three  policies  and  guarantees  a  serializable  execution  only 
if  vertex  programs  do  not  read  or  write  to  neighboring  vertices  and  do  not  write  to  adjacent  edges.  Under 
the  vertex  exclusion  policy  neighboring  vertex  programs  can  run  concurrently.  By  allowing  neighboring 
vertex  programs  to  run,  the  vertex  exclusion  policy  allows  maximum  parallelism.  While  very  few  vertex 
programs  satisfy  this  weak  policy,  it  can  be  useful  when  initializing  data  and  more  importantly  to  explore 
the  non-serializable  executions  of  vertex  programs. 

We  summarize  the  guarantees  of  each  of  the  three  mutual  exclusion  policies  in  the  following: 

Proposition  5.4.4.  GraphLab  guarantees  serializability  under  the  following  three  conditions: 

1.  The  scope  exclusion  policy  is  used 

2.  The  edge  exclusion  policy  is  used  and  vertex  programs  do  not  modify  data  on  adjacent  vertices. 

3.  The  vertex  exclusion  policy  is  used  and  vertex  programs  do  not  read  or  write  to  adjacent  vertices 
and  do  not  write  to  adjacent  edges. 

Because  mutual  exclusion  and  data  serializability  are  maintained  by  the  GraphLab  abstraction,  the  user 
does  not  have  to  deal  with  the  challenges  of  avoiding  deadlock  or  data  races.  This  is  an  important  property 
since,  preventing  deadlock  and  race  conditions  is  one  of  the  principle  challenges  of  parallel  and  distributed 
algorithm  design.  Moreover,  if  the  user  selects  the  correct  mutual  exclusion  policy  and  ensures  that  every 
vertex  program  returns,  the  GraphLab  abstraction  guarantees  a  race  free  and  deadlock  free  execution. 

The  serializability  property  of  the  GraphLab  abstraction  also  enables  the  expression  and  analysis  of  a  much 
richer  class  of  vertex  programs.  For  example,  we  can  easily  implement  graph  coloring  by  having  each 
vertex  choose  a  color  that  is  different  from  its  neighbors.  We  can  prove  that  under  the  edge  exclusion  policy, 
the  graph  coloring  vertex  program  will  converge  and  each  vertex  will  be  executed  only  once.  In  contrast, 
under  the  bulk  synchronous  parallel  model  adopted  by  abstractions  like  Pregel,  the  same  algorithm  will 
never  converge.  Furthermore,  we  can  directly  apply  many  of  the  theoretical  tools  for  analyzing  serial  graph 
algorithms  to  the  analysis  of  parallel  graph  algorithms  running  in  the  GraphLab  abstraction.  For  example, 
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by  running  a  Gibbs  sampler  vertex  program  under  the  edge  exclusion  policy,  we  can  directly  apply  existing 
theoretical  proofs  of  ergodicity  to  the  parallel  execution. 

While  a  serializable  execution  is  essential  when  designing,  implementing,  and  debugging  complex  MLDM 
algorithms,  some  algorithms  arc  robust  to  non-serializable  executions  and  may  benefit  from  the  increased 
parallelism  achieved  by  running  under  a  weaker  mutual  exclusion  policy.  For  example,  work  by  Chafi 
et  al.  [2011],  Recht  et  al.  [2011]  has  demonstrated  that  some  algorithms  can  run  efficiently  even  in  a  fully 
asynchronous  setting.  Therefore  we  allow  users  to  choose  a  weaker  mutual  exclusion  policy. 


5.4.6  Aggregation  Framework 

In  many  MLDM  algorithms  it  is  necessary  to  maintain  global  statistics  describing  data  stored  in  the  data 
graph.  For  example,  many  statistical  inference  algorithms  require  tracking  global  convergence  estimators. 
To  address  this  need,  the  GraphLab  abstraction  defines  global  values  that  may  be  read  by  vertex  programs, 
but  are  written  using  aggregators.  Similar  to  aggregates  in  Pregel,  A  GraphLab  global  aggregate  is  an 
commutative  associative  sum: 

Z  =  Finalize  f  ^^Map(5„)  J  (5.2) 

\v£V  / 

defined  over  all  the  scopes  in  the  graph.  Unlike  Pregel,  GraphLab  aggregates  introduce  a  finalization  phase, 
Finalize(-),  to  support  tasks,  like  normalization,  which  are  common  in  MLDM  algorithms.  Also  in  contrast 
to  Pregel,  where  the  aggregation  runs  after  each  super-step,  aggregation  in  GraphLab  runs  continuously  in 
the  background  to  maintain  updated  estimates  of  the  global  value. 

Since  every  vertex  program  can  access  global  values,  ensuring  serializability  of  global  aggregates  with 
respect  to  vertex  programs  is  costly  and  requires  synchronizing  and  blocking  all  vertex  programs  while 
the  global  value  is  computed.  Therefore  just  as  GraphLab  has  multiple  exclusion  policies  for  vertex 
programs,  GraphLab  similarly  provides  the  option  to  enable  or  disable  serializability  of  global  aggregate 
computations. 

Example  5.4.5  (PageRank:  Example  5.4.1).  We  can  compute  the  second  most  popular  page  on  the  web  by 
defining  the  following  aggregate  operation: 

Map  :  (Sv)  — >  (R(v),  —  oo)  (5.3) 

©  :  (R(ai),R(a2)) ,  (R(b i),R(b2))  -+  TopTwo  ({R{ai) , R(a2) , R^) , R(b2)})  (5.4) 

Finalize  :  (R(a\) , R(a2))  — >  R(a2)  (5.5) 

We  may  want  to  update  the  global  estimate  every  r  =  |  Vj  vertex  updates. 


5.5  Multicore  Design 


We  first  implemented  an  optimized  version  of  the  GraphLab  framework  for  the  shared-memory  multicore 
setting  in  the  form  of  a  C++  API.  The  user  first  constructs  and  populates  the  data  graph  with  initial  data. 
The  user  also  constructs  a  GraphLab  engine  specifying  the  vertex  program,  mutual  exclusion  policy,  and 
scheduler  as  well  as  any  aggregation  operations.  Computation  is  initiated  by  passing  the  data  graph  and  an 
initial  set  of  vertices  to  the  engine.  The  engine  worker  threads  then  run  concurrently  until  no  active  vertices 
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remain  and  the  data  graph  is  modified  in  place.  When  the  engine  returns  the  data  graph  has  been  updated 
and  there  are  no  remaining  active  vertices. 

We  implemented  the  C++  API  by  building  upon  our  work  in  parallel  inference  in  graphical  models.  The 
data  graph  was  implemented  as  a  simple  generic  container  with  user  defined  objects  associated  with  each 
vertex  and  edge  based  on  the  graph  data  structures  used  to  represent  graphical  models. 

Likewise,  the  mutual  exclusion  policies  were  implemented  by  extending  the  ordered  locking  used  in  the 
Markov  blanket  locking  protocol  discussed  in  Section  4.4.  We  associate  a  readers-writer  lock  with  each 
vertex.  The  different  exclusion  policys  are  implemented  using  different  locking  protocols.  Vertex  exclusion 
is  achieved  by  acquiring  a  write-lock  on  the  central  vertex  of  each  requested  scope.  Edge  exclusion  is 
achieved  by  acquiring  a  write  lock  on  the  central  vertex,  and  read  locks  on  adjacent  vertices.  Finally,  scope 
exclusion  is  achieved  by  acquiring  write  locks  on  the  central  vertex  and  all  adjacent  vertices.  As  with 
the  Markov  blanket  locking  protocol,  deadlocks  are  avoided  by  acquiring  locks  sequentially  following  a 
canonical  order. 

The  primary  challenge  was  designing  efficient  parallel  schedulers  and  this  was  achieved  using  the  same 
techniques  discussed  in  Chapter  3.  To  attain  maximum  performance  we  had  to  address  issues  related  to 
parallel  memory  allocation,  concurrent  random  number  generation,  and  cache  efficiency.  Since  mutex 
collisions  can  be  costly,  lock-free  data  structures  and  atomic  operations  were  used  whenever  possible. 
To  achieve  the  same  level  of  performance  for  parallel  learning  system,  the  MLDM  community  would 
have  to  repeatedly  overcome  many  of  the  same  time  consuming  system  design  challenges  needed  to  build 
GraphLab. 


5.6  Multicore  GraphLab  Evaluation 


To  demonstrate  the  expressiveness  of  the  GraphLab  abstraction  and  illustrate  the  parallel  performance 
gains  it  provides,  we  implemented  four  popular  ML  algorithms  and  evaluate  these  algorithms  on  large 
real-world  problems  using  a  16-core  computer  with  4  AMD  Opteron  8384  processors  and  64GB  of  RAM. 
All  single  core  (serial)  timing  was  run  with  locking  disabled  to  eliminate  as  much  system  overhead  as 
possible.  We  fount  that  often  the  system  optimizations  made  to  GraphLab  resulted  in  better  performance 
than  our  previous  application  specific  implementations  of  many  of  the  algorithms. 


5.6.1  Parameter  Learning  and  Inference  in  a  Markov  Random  Field 

To  demonstrate  how  the  various  components  of  the  GraphLab  framework  can  be  assembled  to  build  a 
complete  “pipeline”  for  probabilistic  reasoning  we  use  GraphLab  to  solve  a  novel  three-dimensional  retinal 
image  restoration  task.  In  this  task  we  begin  with  raw  three-dimensional  laser  density  estimates  and 
then  use  GraphLab  to  generate  composite  statistics,  leam  parameters  for  a  large  three-dimensional  grid 
pairwise  MRF,  and  then  finally  compute  expectations  for  each  voxel  using  Loopy  BR  Each  of  these  tasks 
requires  both  local  iterative  computation  and  global  aggregation  as  well  as  several  different  computation 
schedules. 

We  begin  by  using  the  GraphLab  data  graph  to  build  a  large  (256x64x64)  three  dimensional  MRF  in  which 
each  vertex  corresponds  to  a  voxel  in  the  original  image.  We  connect  neighboring  voxels  in  the  6  axis 
aligned  directions.  We  store  the  density  observations  and  beliefs  in  the  vertex  data  and  the  BP  messages  in 
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Algorithm  5.3:  BP  Vertex  Program 
Compute  the  local  belief  b(xv)  using  {D*^.VDV} 
active  neighbors  T 

foreach  (v  — >  t)  G  (v  — >  *)  do 

Update  mv->t(xt)  using  {£>*_».„,  Dv} 
residual  <—  || mv^t{xt)  ~  /n"ld./(;r,)j|l 
if  residual  >  Termination  Bound  then 

j  Add  neighbor  to  active  neighbors  T  <—  T  U  (t,  residual)  with  residual  priority 
return  active  neighbors  T 


Algorithm  5.4:  Parameter  Learning  Aggregator 
Map:  return  deviation  from  empirical  marginal  for  each  pixel 
©  :  Combine  image  statistics  by  averaging 
Finalize:  Apply  gradient  step  to  A  using  the  final  accumulator 


the  directed  edge  data.  We  will  use  the  aggregation  framework  to  maintain  three  global  parameters  which 
determine  the  smoothing  in  each  dimension  along  the  MRF.  Prior  to  learning  the  model  parameters,  we  first 
use  the  GraphLab  aggregation  framework  to  compute  axis-aligned  averages  as  a  proxy  for  ground-truth 
smoothed  images  along  each  dimension.  We  then  performed  concurrent  learning  and  inference  in  GraphLab 
by  using  the  aggregation  framework  (Alg.  5.4)  to  aggregate  inferred  model  statistics  and  apply  a  gradient 
descent  procedure.  To  the  best  of  our  knowledge,  this  is  the  first  time  graphical  model  parameter  learning 
and  BP  inference  have  been  done  concurrently. 


Results:  In  Figure  5.6a  we  plot  the  speedup  of  the  parameter  learning  algorithm,  executing  inference  and 
learning  sequentially.  We  found  that  the  Splash  BP  scheduler  outperforms  other  scheduling  techniques 
enabling  a  factor  15  speedup  on  16  cores.  We  then  evaluated  concurrent  parameter  learning  and  inference 
by  allowing  the  aggregation  framework  to  run  concurrently  with  inference  (Figure  5.6d  and  Figure  5.6e). 
In  Figure  5.6b  and  Figure  5.6c  we  plot  examples  of  noisy  and  denoised  cross  sections  respectively. 

By  running  a  background  aggregation  operation  at  the  right  frequency,  we  found  that  we  can  further 
accelerate  parameter  learning  while  only  marginally  affecting  the  learned  parameters.  In  fact,  by  frequently 
taking  gradient  steps  (once  every  3  seconds),  before  belief  propagation  has  converged  to  an  accurate 
gradient  estimator,  we  are  able  to  obtain  a  factor  of  three  reduction  in  the  overall  running  time.  The 
substantial  reduction  has  leads  to  slightly  different  parameters  (differs  by  4%)  but  does  not  affect  the  quality 
of  the  final  prediction.  We  believe  this  provides  a  strong  indication  of  the  potential  of  the  asynchronous 
approach  in  the  GrAD  methodology. 


5.6.2  Gibbs  Sampling 

To  demonstrate  the  power  of  the  GraphLab  abstraction  we  implement  a  series  of  provably  ergodic  (asymp¬ 
totically  consistent)  Gibbs  samplers.  In  Chapter  4  we  demonstrated  how  the  Gibbs  sampling  algorithm 
is  inherently  sequential  and  has  frustrated  efforts  to  build  asymptotically  consistent  parallel  samplers. 
Flowever,  by  exploiting  serializability  we  are  able  to  easily  construct  an  asymptotically  consistent  parallel 
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(a)  Speedup 


(b)  Original  (c)  Denoised 


(d)  Bkgnd  Agg.  Runtime  (e)  Bkgnd  Agg.  Error 

Figure  5.6:  Retinal  Scan  restoration  (a)  Speedup  relative  to  the  best  single  processor  runtime  of  parameter 
learning  using  priority,  and  Splash  schedules.  (b,c)  A  slice  of  the  original  noisy  image  and  the  corresponding 
expected  pixel  values  after  parameter  learning  and  restoration,  (d  )  The  total  runtime  in  seconds  of  parameter 
learning  and  (e)  the  average  percent  deviation  in  learned  parameters  plotted  against  the  time  between 
gradient  steps  using  the  Splash  schedule  on  16  processors. 


Gibbs  sampler.  In  this  example  we  compare  the  FiFo  and  Sweep  schedulers  with  a  custom  built  chromatic 
schedule  from  Section  4.3. 

To  apply  the  chromatic  schedule  we  needed  to  first  construct  a  coloring  of  the  data  graph.  This  is 
accomplished  by  implementing  the  standard  greedy  graph  coloring  algorithm  as  a  GraphLab  vertex 
program  which  examines  the  colors  of  the  neighboring  vertices  of  v,  and  sets  v  to  the  first  unused  color. 
We  use  the  edge  exclusion  policy  to  ensure  that  the  parallel  execution  retains  the  same  guarantees  as  the 
sequential  version.  We  were  able  to  color  the  graph  in  fractions  of  a  second.  Once  we  have  a  valid  graph 
coloring  we  can  then  revert  to  the  vertex  exclusion  policy  when  using  the  Chromatic  schedule. 

To  evaluate  the  GraphLab  implementation  of  the  parallel  Gibbs  sampler  we  consider  the  challenging  task 
of  marginal  estimation  on  a  factor  graph  representing  a  protein-protein  interaction  network  obtained  from 
Elidan  et  al.  [2006]  by  generating  10,000  samples.  The  resulting  graph  (MRF)  has  roughly  100K  edges 
and  14K  vertices.  Unlike  the  earlier  pairwise  MRFs  these  networks  arc  generally  more  challenging  with 
higher  order  factors  and  more  irregular  structure  and  substantially  greater  connectivity.  In  Figure  5.7  we 
plot  the  color  distribution  for  the  data-graph  and  we  see  that  their  arc  relatively  few  colors.  As  a  baseline 
for  comparison  we  also  ran  the  GraphLab  version  Splash  Loopy  BP  (Chapter  3)  algorithm. 
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Figure  5.7:  Color  distribution  for  the  protein-protein  interaction  network. 


Results:  In  Figure  5.8  we  present  the  runtime,  speedup,  and  efficiency  results  for  Gibbs  sampling  and 
Loopy  BP.  Consistent  with  our  earlier  findings  in  Chapter  4  the  chromatic  sampler  is  able  to  scale  slightly 
better  because  we  can  eliminate  the  need  for  the  heavy  locking  in  the  edge  exclusion  policy.  In  contrast  the 
Loopy  BP  speedup  demonstrates  considerably  better  scaling  with  factor  of  15  speedup  on  16  processor. 
Also  consistent  with  our  earlier  findings  the  SplashBP  scheduler  is  able  to  accelerate  convergence.  The 
larger  cost  per  BP  update  in  conjunction  with  the  ability  to  run  a  fully  asynchronous  schedule  enables 
Loopy  BP  to  achieve  relatively  uniform  update  efficiency  compared  to  Gibbs  sampling.  It  is  worth  noting 
that  the  difference  between  the  SplashBP  and  ResidualBP  algorithms  in  the  GraphLab  framework  is  a 
single  command  line  argument  (i.e.,  — scheduler=splash  vs.  — scheduler=priority). 


5.6.3  Named  Entity  Recognition  (NER) 

To  illustrate  how  GraphLab  scales  in  settings  with  large  structured  models  we  designed  and  implemented  a 
parallel  version  of  CoEM  [Jones,  2005,  Nigam  and  Ghani,  2000],  a  semi-supervised  learning  algorithm  for 
named  entity  recognition.  Named  Entity  Recognition  (NER)  is  the  task  of  determining  the  type  (e.g..  Person, 
Place,  or  Thing)  of  a  noun-phrase  (e.g.,  Obama,  Chicago,  or  Car)  from  its  context  (e.g.,  “President 
“lives  near  or  “bought  a  NER  is  used  in  many  natural  language  processing  applications 
as  well  as  information  retrieval.  In  this  application  we  obtained  a  large  crawl  of  the  web  from  the  NELL 
project  [Carlson  et  al.,  2010]  and  constructed  two  NER  datasets  (Table  5.2)  in  which  we  counted  the 
number  of  occurrences  of  each  noun-phrase  in  each  context.  Stalling  with  a  small  seed  set  of  pre-labeled 
noun-phrases,  the  CoEM  algorithm  labels  the  remaining  noun-phrases  and  contexts  (see  Table  5.2b)  by 
alternating  between  estimating  the  type  of  each  noun-phrase  given  the  types  of  its  contexts  and  estimating 
the  type  of  each  context  given  the  types  of  its  noun-phrases. 

The  data  graph  for  the  NER  problem  is  bipartite  with  one  set  of  vertices  corresponding  to  noun-phrases 
and  other  corresponding  to  each  context.  There  is  an  edge  between  a  noun-phrase  and  a  context  if  the 
noun-phrase  occurs  in  the  context.  The  vertices  store  the  estimated  distribution  over  types  and  the  edges 
store  the  number  of  times  the  noun-phrase  appears  in  a  context.  The  CoEM  vertex  program  (Alg.  5.5) 
simply  updates  the  estimated  distribution  for  a  vertex  (either  noun-phrase  or  context)  based  on  the  current 
distribution  for  neighboring  vertices.  We  implemented  a  dynamic  version  of  the  CoEM  algorithm  in 
which  adjacent  vertices  are  rescheduled  if  the  type  distributions  at  a  vertex  changes  by  more  than  some 
predefined  threshold  (10-5).  The  CoEM  vertex  program  is  relatively  fast,  requiring  only  a  few  floating 
point  operations,  and  therefore  stresses  the  GraphLab  implementation  by  requiring  GraphLab  to  manage 
massive  amounts  of  fine-grained  parallelism. 

We  plot  in  Figure  5.9a  and  Figure  5.9b  the  speedup  on  both  small  and  large  datasets  respectively.  On  both 
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Figure  5.8:  MRF  Inference  We  ran  the  Gibbs  sampler  vertex  program  with  three  different  schedules  and 
report  the  runtime  (a),  speedup  (b),  and  work  efficiency  (c).  Consistent  with  the  results  in  Chapter  4  we 
find  that  the  chromatic  schedule  is  the  fastest  and  most  efficient  in  terms  of  raw  sample  generation  rate.  To 
provide  a  basis  for  comparison  we  also  ran  loopy  belief  propagation  on  the  same  problem  evaluating  two 
different  schedulers  and  again  reporting  runtime  (d),  speedup  (e),  and  work  efficiency  (f). 


datasets  we  see  relatively  strong  scaling  performance.  However,  we  see  better  scaling  on  the  larger  dataset 
which  is  expected  since  the  computation  time  for  each  vertex  program  goes  up  with  the  number  of  types 
hiding  much  of  the  GraphLab  system  overhead.  With  16  parallel  processors,  we  could  complete  three  full 
Round-robin  iterations  on  the  large  dataset  in  less  than  30  minutes  (including  data  loading  time).  As  a 
comparison,  a  comparable  Hadoop  implementation1  took  approximately  7.5  hours  to  complete  the  exact 
same  task,  executing  on  an  average  of  95  cpus. 

Using  the  flexibility  of  the  GraphLab  framework  we  were  able  to  study  the  benefits  of  FiFo  scheduling 
versus  the  original  sweep  schedule  proposed  in  the  CoEM.  Figure  5.9c  compares  the  number  of  updates 
required  by  both  schedules  to  obtain  a  result  of  comparable  quality  on  the  larger  dataset.  Here  we  measure 
quality  by  L\  parameter  distance  to  an  empirical  estimate  of  the  fixed  point  x* ,  obtained  by  running  a  large 
number  of  iterations.  For  this  application  we  do  not  find  a  substantial  benefit  from  switching  to  a  FiFo 
schedule. 

We  also  investigated  how  GraphLab  scales  with  problem  size.  Figure  5.9d  shows  the  maximum  speedup 
on  16  cpus  attained  with  varying  graph  sizes,  generated  by  sub-sampling  a  fraction  of  vertices  from  the 
large  dataset.  We  find  that  parallel  scaling  improves  with  problem  size  and  that  even  on  smaller  problems 
GraphLab  is  still  able  to  achieve  a  factor  of  12  speedup  on  16  cores. 

'Personal  communication  with  Justin  Betteridge  and  Tom  Mitchell,  Mar  12,  2010 
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Algorithm  5.5:  CoEM  Vertex  Program  for  Named  Entity  Recognition 
Input:  Vertex  data  irv  £  Sv  the  distribution  for  this  vertex. 

Input:  Vertex  data  {7 tu  :  u  £  J\fv}  £  Sv  the  distributions  for  neighbors. 
Input:  Edge  data  {wu<v  :  u  £  A fv}  €  Sv  weights  based  on  co-occurrence 

//  Update  the  distribution  on  this  vertex 

7T°ld  £-  7 Tv 

7r V  V-  Y2ueAfv  Wu,v'Ku 

//  Determine  which  neighbors  need  to  be  recomputed 
T  ^0 

foreach  u  £  J\fv  do 

if  |7r°w  -  7r^|mUi„  >  e  then 

|_  rtruH 

return  T 


Food 

Religion 

City 

onion 

garlic 

noodles 

blueberries 

beans 

Catholic 

Fremasonry 

Marxism 

Catholic  Chr. 

Humanism 

Munich 
Cape  Twn. 
Seoul 

Mexico  Cty. 
Winnipeg 

Name 

Classes 

Verts. 

Edges 

Runtime 

small 

1 

0.2  mil. 

20  mil. 

40  min 

large 

135 

2  mil. 

200  mil. 

8  hours 

(a)  NER  Datasets  (b)  NER  Types 

Table  5.2:  (a)  Dataset  sizes  for  the  CoEM  benchmark  and  the  corresponding  single  processor  runtime,  (b) 
Top  words  for  several  types. 

5.6.4  Lasso 

The  Lasso  [Tibshirani,  1996]  is  a  popular  feature  selection  and  shrinkage  method  for  linear  regression 
which  minimizes  the  objective  L(w)  =  ^''=]  (wTXj  —  yj)2  +  A  | w 1 1 , .  At  the  time  of  this  work  there  did 
not  exist,  a  parallel  algorithm  for  fitting  a  Lasso  model.  In  this  section  we  easily  implement  two  different 
parallel  algorithms  for  solving  the  Lasso.  We  use  GraphLab  to  implement  the  Shooting  Algorithm  [Fu, 
1998],  a  popular  Lasso  solver,  and  demonstrate  that  GraphLab  is  able  to  automatically  obtain  parallelism 
by  identifying  operations  that  can  execute  concurrently  while  retaining  serializability. 

The  shooting  algorithm  works  by  iteratively  minimizing  the  objective  with  respect  to  each  dimension  in  w, 
corresponding  to  coordinate  descent.  We  can  formulate  the  Shooting  algorithm  in  the  GraphLab  framework 
as  a  bipartite  graph  with  a  vertex  for  each  weight  Wi  and  a  vertex  for  each  observation  yr  An  edge  is 
created  between  Wi  and  y}  with  weight  Xtj  if  and  only  if  AQj  is  non-zero.  We  also  define  a  vertex  program 
(Alg.  5.6)  which  operates  only  on  the  weight  vertices,  and  corresponds  exactly  to  a  single  minimization 
step  in  the  shooting  algorithm.  A  sweep  scheduling  of  Alg.  5.6  on  all  weight  vertices  corresponds  exactly 
to  the  sequential  shooting  algorithm.  We  automatically  obtain  an  equivalent  parallel  algorithm  by  select  the 
full  serializability  model.  Hence,  by  encoding  the  shooting  algorithm  in  GraphLab  we  arc  able  to  discover 
a  serializable  parallelization  based  on  the  sparsity  structure  in  our  data. 

We  evaluate  the  performance  of  the  GraphLab  implementation  on  a  financial  data  set  obtained  from  Kogan 
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(a)  CoEM  Speedup  Small 


(b)  CoEM  Speedup  Large 


(c)  Convergence 


Single  Processor  Runtime  (Seconds) 
(d)  Speedup  with  Problem  Size 


Figure  5.9:  CoEM  Results  (a,b)  Speedup  measured  relative  to  fastest  running  time  on  a  single  cpu  for 
both  datasets.  The  large  dataset  achieves  better  scaling  because  the  increased  number  of  types  increases 
the  computation  time  of  each  vertex  program  and  hides  most  of  the  GraphLab  overhead,  (c)  Speed  of 
convergence  measured  in  number  of  updates  for  FiFo  and  Sweep,  (d)  Speedup  achieved  with  16  cpus  as 
the  graph  size  is  varied. 


et  al.  [2009].  The  task  is  to  use  word  counts  of  a  financial  report  to  predict  stock  volatility  of  the  issuing 
company  for  the  consequent  12  months.  Data  set  consists  of  word  counts  for  30K  reports  with  the  related 
stock  volatility  metrics. 

To  evaluate  the  affect  of  the  various  exclusion  policies  on  the  parallel  scaling  and  algorithm  convergence 
we  create  two  datasets  by  deleting  common  words.  The  sparser  dataset  contains  209K  features  and  1.2M 
non-zero  entries,  and  the  denser  dataset  contains  217K  features  and  3.5M  non-zero  entries.  The  speedup 
curves  are  plotted  in  Figure  5.10.  We  observed  better  scaling  (4x  at  16  CPUs)  on  the  sparser  dataset  than 
on  the  denser  dataset  (2x  at  16  CPUs).  This  demonstrates  that  ensuring  full  serializability  on  denser  graphs 
inevitably  increases  contention  resulting  in  reduced  scaling. 

Additionally,  we  experimented  with  relaxing  the  serializability  constraint  by  switching  to  the  vertex 
exclusion  policy.  We  discovered  that  the  shooting  algorithm  still  converges  when  serializability  is  no 
longer  enforced;  obtaining  solutions  with  only  0.5%  higher  loss  on  the  same  termination  criterion.  The 
vertex  exclusion  model  exposes  substantially  more  parallelism  and  we  therefore  observe  significantly  better 
scaling,  especially  on  the  denser  dataset. 
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Algorithm  5.6:  Shooting  Algorithm 
Minimize  the  loss  function  with  respect  to  Wi 
if  Wi  changed  by  >  e  then 

Revise  the  residuals  on  all  y's  connected  to  iut 
Schedule  all  w's  connected  to  neighboring  -y's 


Algorithm  5.7 :  Compressed  Sensing  Outer  Loop 

while  duality. gap  >  e  do 

Update  edge  and  node  data  of  the  data  graph. 
Use  GraphLab  to  run  GaBP  on  the  graph 
Use  aggregator  to  compute  duality  gap 
Take  a  newton  step 

end 


5.6.5  Compressed  Sensing 

To  show  how  GraphLab  can  be  used  as  a  subcomponent  of  a  larger  sequential  algorithm,  we  implement 
a  variation  of  the  interior  point  algorithm  proposed  by  Kim  et  al.  [2007]  for  the  purposes  of  compressed 
sensing.  The  aim  is  to  use  a  sparse  linear  combination  of  basis  functions  to  represent  an  image,  while 
minimizing  the  reconstruction  error.  Sparsity  is  achieved  through  the  use  of  elastic  net  regularization. 

The  interior  point  method  is  a  double  loop  algorithm  where  the  sequential  outer  loop  (Alg.  5.7)  implements 
a  Newton  method  while  the  inner  loop  computes  the  Newton  step  by  solving  a  sparse  linear  system  using 
GraphLab.  We  used  Gaussian  BP  (GaBP)  as  a  linear  solver  [Bickson,  2008]  since  it  has  a  natural  GraphLab 
representation.  The  GaBP  GraphLab  construction  follows  closely  the  BP  example  in  Section  5.6.1,  but 
represents  potentials  and  messages  analytically  as  Gaussian  distributions.  In  addition,  the  outer  loop  uses  a 
Sync  operation  on  the  data  graph  to  compute  the  duality  gap  and  to  terminate  the  algorithm  when  the  gap 
falls  below  a  predefined  threshold.  Because  the  graph  structure  is  fixed  across  iterations,  we  can  leverage 
data  persistency  in  GraphLab,  avoid  both  costly  set  up  and  tear  down  operations  and  resume  from  the 
converged  state  of  the  previous  iteration. 

We  evaluate  the  performance  of  this  algorithm  on  a  synthetic  compressed  sensing  dataset  constructed  by 
applying  a  random  projection  matrix  to  a  wavelet  transform  of  a  256  x  256  Lenna  image  (Figure  5.11). 
Experimentally,  we  achieved  a  factor  of  8  speedup  using  16  processors  using  sweep  scheduling. 


5.7  Distributed  GraphLab  Design 


In  this  section  we  extend  the  shared  memory  design  of  the  GraphLab  system  to  the  substantially  more 
challenging  distributed  setting  and  discuss  the  techniques  required  to  achieve  this  goal.  An  overview  of 
the  distributed  design  is  illustrated  in  Figure  5.15a.  Because  of  the  inherently  random  memory  access 
patterns  common  to  dynamic  asynchronous  graph  algorithms,  we  focus  on  the  distributed  in-memory 
setting,  requiring  the  entire  graph  and  all  program  state  to  reside  in  RAM. 
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(c)  Sparser  Dataset  Runtime 


(d)  Denser  Dataset  Runtime 


Figure  5.10:  Shooting  Algorithm  (a)  Speedup  on  the  sparser  dataset  using  vertex  exclusion  and  scope 
exclusion  (i.e.,  serializability)  relative  to  the  fastest  single  processor  runtime,  (b)  Speedup  on  the  denser 
dataset  using  vertex  exclusion  and  scope  exclusion  relative  to  the  fastest  single  processor  runtime. 


We  first  present  our  distributed  data  graph  representation  that  allows  for  rapid  repartitioning  across  different 
loads  cluster  sizes.  Next,  we  describe  two  versions  of  the  GraphLab  engine  for  the  distributed  setting.  The 
first  engine  is  the  simpler  chromatic  engine  (Section  5.7.2)  that  uses  basic  graph  coloring  to  achieve  the 
various  exclusion  policies.  The  second  is  a  locking  engine  (Section  5.7.2)  that  directly  extends  the  lock 
based  implementation  of  the  exclusion  policies  to  the  distributed  setting. 


5.7.1  The  Distributed  Data  Graph 

Efficiently  implementing  the  data  graph  in  the  distributed  setting  requires  balancing  computation,  commu¬ 
nication,  and  storage.  Therefore,  we  need  to  construct  balanced  partitioning  of  the  data  graph  that  minimize 
number  of  edges  that  cross  between  machines.  Because  the  Cloud  setting  enables  the  size  of  the  cluster  to 
vary  with  budget  and  performance  demands,  we  must  be  able  to  quickly  load  the  data-graph  on  varying 
sized  Cloud  deployments.  To  resolve  these  challenges,  we  developed  a  graph  representation  based  on 
two-phased  partitioning  which  can  be  efficiently  load  balanced  on  a  range  of  cluster  sizes. 

Motivated  by  our  work  on  over  partitioning  for  distributed  belief  propagation,  the  data  graph  is  initially 
over-partitioned  using  domain  specific  knowledge  (e.g.,  planar  embedding),  or  by  using  a  distributed  graph 
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16 


(a)  Speedup  (b)  Lenna  (c)  Lenna  50% 

Figure  5.11:  (a)  Speedup  of  the  Interior  Point  algorithm  on  the  compressed  sensing  dataset,  (b)  Original 
256x256  test  image  with  65,536  pixels,  (c)  Output  of  compressed  sensing  algorithm  using  32,768  random 
projections. 


Atom  1  (Weight  =  4) 


(c)  Meta-Graph 


Figure  5.12:  Partitions  and  Ghosts.  To  represent  a  distributed  graph  we  partition  the  graph  into  files  each 
containing  fragments  of  the  original  graph.  In  (a)  we  cut  a  graph  into  two  (unbalanced)  pieces.  In  (b)  we 
illustrate  the  ghosting  process  for  the  resulting  two  atoms.  The  edges,  vertices,  and  data  illustrated  with 
broken  edges  are  ghosts  in  the  respective  atoms.  The  meta-graph  (c)  is  a  weighted  graph  of  two  vertices. 


partitioning  heuristic  (e.g.,  ParMetis  [Karypis  and  Kumar,  1998],  Random  Hashing)  into  k  parts  where  k 
is  much  greater  than  the  number  of  machines.  Each  part,  called  an  atom,  is  stored  as  a  separate  file  on  a 
distributed  storage  system  (e.g.,  HDFS,  Amazon  S3).  Each  atom  file  contains  a  subset  of  edges  and  vertices 
as  well  as  the  corresponding  metadata.  In  addition,  each  atom  stores  information  regarding  ghosts:  the  set 
of  vertices  and  edges  adjacent  to  the  partition  boundary.  The  connectivity  structure  and  file  locations  of  the 
k  atoms  is  stored  in  a  atom  index  file  as  a  meta-graph  with  k  vertices  (corresponding  to  the  atoms)  and 
edges  encoding  the  connectivity  of  the  atoms. 

Distributed  loading  is  accomplished  by  performing  a  fast  balanced  partition  of  the  meta-graph  over  the 
number  of  physical  machines.  Each  machine  then  constructs  its  local  portion  of  the  graph  by  playing  back 
the  journal  from  each  of  its  assigned  atoms.  The  playback  procedure  also  instantiates  the  ghost  of  the  local 
partition  in  memory.  The  ghosts  are  used  as  caches  for  their  true  counterparts  across  the  network.  Cache 
coherence  is  managed  using  a  simple  versioning  system,  eliminating  the  transmission  of  unchanged  or 
constant  data  (e.g.,  edge  weights). 

The  two-stage  partitioning  technique  allows  the  same  graph  partition  computation  to  be  reused  for  different 
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numbers  of  machines  without  requiring  a  full  repartitioning  step.  Furthermore,  random  or  even  adaptive 
placement  of  the  atoms  can  be  used  in  situations  where  load  balancing  is  needed  (see  Section  3.8.1).  We 
evaluated  two  stage  partitioning  on  a  wide  range  of  graphs  obtained  from  Leskovec  [2011]  and  found  that 
if  the  initial  atom  graph  is  well  partitioned  then  the  two  phase  partitioning  performs  comparably.  This  is  to 
be  expected  since  typical  graph  partitioning  algorithms  perform  a  similar-  coarsening  procedure. 

The  use  of  ghost  vertices/edges  in  this  design  also  introduces  a  limitation:  the  total  amount  of  memory 
required  by  a  single  machine  is  not  simply  ^  of  the  graph,  but  is  a  function  of  the  graph  structure.  Dense 
graphs  (e.g.,  a  clique)  could  require  every  machine  to  store  a  copy  of  the  complete  graph.  However,  such 
dense  graphs  would  violate  the  GrAD  methodology  and  are  not  common  in  large-scale  problems. 

In  our  later  experiments  we  found  that  certain  graphs  were  too  difficult  for  Metis/ParMetis  to  cut  in 
reasonable  time  and  memory.  In  these  cases  we  resorted  to  random  partitioning  to  provide  the  initial  atoms. 
In  the  next  chapter  we  will  return  to  this  problem  and  show  that  many  real-world  graphs,  although  sparse, 
have  a  particular  sparsity  pattern  that  will  ultimately  lead  to  costly  replication  and  frustrate  any  edge  cut 
based  partitioning  technique. 


5.7.2  Distributed  GraphLab  Engines 

The  Distributed  GraphLab  engine  emulates  the  execution  model  defined  in  Section  5.4.3  and  is  responsible 
for  executing  vertex  programs  and  sync  operations,  maintaining  the  set  of  scheduled  vertices  T,  and 
ensuring  serializability  with  respect  to  the  appropriate  exclusion  policy  (see  Section  5.4.5).  As  discussed 
earlier,  the  precise  order  in  which  vertices  are  removed  from  T  is  up  to  the  execution  engine  and  scheduler 
and  can  affect  performance  and  expressiveness.  To  evaluate  this  trade-off  we  built  the  low-overhead 
Chromatic  Engine,  which  executes  T  partially  asynchronously  according  to  a  fixed  sweep  schedule,  and 
the  more  expressive  Locking  Engine  which  is  fully  asynchronous  and  supports  relaxed  versions  of  the 
FiFo  and  priority  schedulers. 


Chromatic  Engine 

In  our  work  on  parallel  Gibbs  sampling  we  introduced  the  Chromatic  Gibbs  sampler  (Section  4.3)  which 
uses  graph  color  to  achieve  a  serializable  execution  without  the  use  of  fine  grained  locking.  The  Chromatic 
GraphLab  engine  extends  this  classic  technique  to  achieve  a  serializable  parallel  execution  of  vertex 
programs.  Given  a  vertex  coloring  of  the  data  graph,  we  can  satisfy  the  edge  exclusion  policy  by  executing, 
synchronously,  all  vertices  of  the  same  color  in  the  vertex  set  T  before  proceeding  to  the  next  color.  We 
use  the  term  color-step,  in  analogy  to  the  super-step  in  the  BSP  model,  to  describe  the  process  of  updating 
all  the  vertices  within  a  single  color  and  communicating  all  changes.  The  sync  operation  can  then  be  run 
safely  between  color-steps. 

We  can  satisfy  each  of  the  exclusion  policies  simply  by  changing  how  the  vertices  are  colored.  The  scope 
exclusion  policy  is  satisfied  by  constructing  a  second-order  vertex  coloring  (i.e.,  no  vertex  shares  the  same 
color  as  any  of  its  distance  two  neighbors).  The  vertex  exclusion  policy  is  satisfied  by  assigning  all  vertices 
the  same  color.  While  optimal  graph  coloring  is  NP-hard  in  general,  a  reasonable  quality  coloring  can  be 
constructed  quickly  using  graph  coloring  heuristics  (e.g.,  greedy  coloring).  Furthermore,  many  MLDM 
problems  produce  graphs  with  trivial  colorings.  For  example,  many  optimization  problems  in  MLDM  are 
naturally  expressed  as  bipartite  (two-colorable)  graphs,  while  problems  based  upon  template  models  can  be 
easily  colored  using  the  template  [Gonzalez  et  al.,  2011]. 
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Algorithm  5.8:  Pipelined  Locking  Engine  Thread  Loop 
while  not  done  do 

while  Lock  acquisition  pipeline  is  not  full  and  the  scheduler  is  not  empty  do 
Acquire  a  vertex  v  from  scheduler 
Add  task  to  lock  acquisition  pipeline 

if  Lock  acquisition  pipeline  has  ready  vertex  v  then 
Execute  ( T',SV )  =  f(v,Sy) 

/ /  update  scheduler  on  each  machine 

For  each  machine  p,  Send  {s  €  T'  :  owner (s)  =  p) 

Release  locks  and  push  changes  to  Sv  in  background 

else 

Wait  on  the  lock  acquisition  pipeline 


While  the  chromatic  engine  operates  in  synchronous  color-steps,  changes  to  ghost  vertices  and  edges  arc 
communicated  asynchronously  as  they  are  made.  Consequently,  the  chromatic  engine  efficiently  uses 
both  network  bandwidth  and  processor  time  within  each  color-step.  However,  we  must  ensure  that  all 
modifications  are  communicated  before  moving  to  the  next  color  and  therefore  we  require  a  synchronization 
barrier  between  color-steps. 

The  simple  design  of  the  Chromatic  engine  is  made  possible  by  the  explicit  communication  structure  defined 
by  the  data  graph,  allowing  data  to  be  pushed  directly  to  the  machines  requiring  the  information.  In  addition, 
the  cache  versioning  mechanism  further  optimizes  communication  by  only  transmitting  modified  data.  The 
advantage  of  the  Chromatic  engine  lies  its  predictable  execution  schedule.  Repeated  invocations  of  the 
chromatic  engine  will  always  produce  identical  execution  sequences,  regardless  of  the  number  of  machines 
used.  This  property  makes  the  Chromatic  engine  highly  suitable  for  testing  and  debugging  purposes.  The 
global  barriers  at  the  end  of  each  color  provide  a  natural  breakpoint  for  debugging  capabilities.  We  provide 
a  distributed  debugging  tool  that  halts  at  the  end  of  each  color,  allowing  graph  data  and  scheduler  state  to 
be  queried. 

The  Chromatic  engine  supports  limited  dynamic  scheduling.  On  each  iteration,  only  active  vertices  are 
executed  within  each-color  step  and  vertices  can  only  activate  vertices  that  are  a  different  color.  Moreover, 
the  chromatic  engine  exactly  preserves  the  semantics  of  the  sweep  schedule  where  vertices  are  ordered 
by  color.  As  a  consequence  while  the  execution  may  be  composed  of  synchronous  sub-steps  it  actually 
corresponds  to  a  fixed  serial  execution  on  the  graph. 


Distributed  Locking  Engine 

While  the  Chromatic  engine  satisfies  the  GraphLab  execution  model  and  is  capable  of  expressing  dynamic 
computation,  it  does  not  provide  sufficient  scheduling  flexibility  (e.g.,  FiFo  and  Priority)  for  algorithms 
like  SplashBP  and  Splash  Gibbs  sampling.  To  overcome  these  limitations,  we  introduce  the  distributed 
locking  engine  which  extends  the  mutual  exclusion  technique  used  in  the  shared  memory  engine  to  the 
distributed  setting. 

Since  the  graph  is  partitioned,  each  machine  is  only  responsible  for  scheduling  and  executing  vertices  that 
have  been  assigned  to  that  machine.  This  enables  each  machine  to  run  one  of  the  many  shared-memory 
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(a)  Runtime  (b)  Pipeline  Length 


Figure  5.13:  (a)  Plots  the  runtime  of  the  Distributed  Locking  Engine  on  a  synthetic  loopy  belief  propagation 
problem  varying  the  number  of  machines  with  pipeline  length  =  10,  000.  (b)  Plots  the  runtime  of  the 
Distributed  Locking  Engine  on  the  same  synthetic  problem  on  16  machines  (128  CPUs),  varying  the 
pipeline  length.  Increasing  pipeline  length  improves  performance  with  diminishing  returns. 


schedulers  allowing  a  much  greater  range  of  dynamic  schedules.  The  ghost  vertices  and  edges  ensure 
that  the  update  have  direct  memory  access  to  all  information  in  the  scope.  The  ghosting  system  also 
eliminates  the  need  to  transmit  or  wait  on  data  that  has  not  changed  remotely  reducing  network  traffic  and 
latency.  Each  worker  thread  on  each  machine  evaluates  the  loop  described  in  Alg.  5.8  until  the  scheduler  is 
empty. 

To  hide  the  latency  of  distributed  locking  we  pipeline  the  locking  processess.  When  a  vertex  is  removed 
from  the  scheduler  its  locks  are  requested  and  it  is  placed  on  the  pipeline  freeing  the  processor  to  execute 
other  vertex  programs  or  assist  in  lock  processing.  Once  all  the  locks  for  a  vertex  have  been  acquired  from 
remote  machines  the  vertex  is  removed  from  the  pipeline  and  executed.  As  locks  are  acquired  the  local 
ghost  information  is  updated. 

To  evaluate  the  performance  of  the  distributed  pipelining  system,  we  constructed  a  three-dimensional 
mesh  of  300  x  300  x  300  =  27,  000, 000  vertices.  Each  vertex  is  26-connected  (to  immediately  adjacent 
vertices  along  the  axis  directions,  as  well  as  all  diagonals),  producing  over  375  million  edges.  The  graph 
is  partitioned  using  Metis  [Karypis  and  Kumar,  1998]  into  512  atoms.  We  interpret  the  graph  as  a  binary 
Markov  Random  Field  and  evaluate  the  runtime  of  loopy  belief  propagation  varying  the  length  of  the 
pipeline  from  100  to  10,000,  and  the  number  of  EC2  cluster  compute  instance  (ccl .  4xlarge)  from  4 
machines  (32  processors)  to  16  machines  (128  processors).  We  observe  in  Figure  5.13a  that  the  distributed 
locking  system  provides  strong,  nearly  linear,  scalability.  In  Figure  5.13b  we  evaluate  the  efficacy  of  the 
pipelining  system  by  increasing  the  pipeline  length.  We  find  that  increasing  the  length  from  100  to  1000 
leads  to  a  factor  of  three  reduction  in  runtime. 


5.7.3  Fault  Tolerance 

We  achieve  fault  tolerance  in  the  distributed  GraphLab  framework  using  a  distributed  checkpoint  mechanism. 
In  the  event  of  a  failure,  the  system  is  recovered  from  the  last  checkpoint.  We  evaluate  two  strategies  to 
construct  distributed  snapshots:  a  synchronous  method  that  suspends  all  computation  while  the  snapshot  is 
constructed,  and  an  asynchronous  method  that  incrementally  constructs  a  snapshot  without  suspending 
execution. 
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Algorithm  5.9:  Snapshot  Vertex  Program  on  vertex  v 
if  v  was  already  snapshotted  then 

L  Quit 

Save  Dv  / /  Save  current  vertex 

foreach  u  G  J\fv  do  //  Loop  over  neighbors 

if  n  was  not  snapshotted  then 
Save  data  on  edge  Du^v 
Schedule  u  for  a  Snapshot  Update 

Mark  v  as  snapshotted 


The  synchronous  snapshots  are  easily  constructed  by  suspending  execution  of  vertex  programs,  flushing  all 
communication  channels,  and  then  saving  all  modified  data  since  the  last  snapshot.  If  a  machine  fails  we 
simply  restart  the  system  with  the  data  graph  at  the  last  checkout-point.  Interestingly,  we  implemented  the 
synchronous  snapshot  as  a  GraphLab  aggregate  serializability  enabled. 

Unfortunately,  synchronous  snapshots  expose  the  GraphLab  engine  to  the  same  inefficiencies  of  syn¬ 
chronous  computation  that  GraphLab  is  trying  to  address.  Using  the  GraphLab  abstraction  we  designed  and 
implemented  a  variant  of  the  Chandy-Lamport  [Chandy  and  Lamport,  1985]  snapshot  specifically  tailored 
to  the  GraphLab  data-graph  and  execution  model.  The  resulting  algorithm  (Alg.  5.9)  can  be  implemented 
as  a  vertex  program  and  guarantees  a  consistent  snapshot.  The  proof  of  correctness  follows  naturally  from 
the  original  proof  in  Chandy  and  Lamport  [1985]  with  the  machines  and  channels  replaced  by  vertices  and 
edges  and  messages  corresponding  to  scope  modifications. 


Both  the  synchronous  and  asynchronous  snapshots  are  initiated  at  fixed  intervals.  The  choice  of  interval 
must  balance  the  cost  of  constructing  the  checkpoint  with  the  computation  lost  since  the  last  checkpoint 
in  the  event  of  a  failure.  Young  et  al.  [Young,  1974]  derived  a  first-order  approximation  to  the  optimal 
checkpoint  interval: 


^Interval  —  \J  2Tcheckpoint2~MTBF 


(5.6) 


where  Tcheckpoint  is  the  time  it  takes  to  complete  the  checkpoint  and  Imtbf  is  the  mean  time  between 
failures  for  the  cluster.  For  instance,  using  a  cluster  of  64  machines,  a  per  machine  MTBF  of  1  year,  and  a 
checkpoint  time  of  2  min  leads  to  optimal  checkpoint  intervals  of  3  hrs.  Therefore,  for  the  deployments 
considered  in  our  experiments,  even  taking  pessimistic  assumptions  for  Tmtbf*  leads  to  checkpoint  intervals 
that  far  exceed  the  runtime  of  our  experiments  and  in  fact  also  exceed  the  Hadoop  experiment  runtimes. 
This  brings  into  question  the  emphasis  on  strong  fault  tolerance  in  Hadoop.  Better  performance  can  be 
obtained  by  balancing  fault  tolerance  costs  against  that  of  a  job  restart. 


We  evaluate  the  performance  of  the  snapshotting  algorithms  on  the  same  synthetic  mesh  problem  described 
in  the  previous  section,  running  on  16  machines  (128  processors).  We  configure  the  implementation  to 
issue  exactly  one  snapshot  in  the  middle  of  the  second  iteration.  In  Figure  5.14a  we  plot  the  number  of 
updates  completed  against  time  elapsed.  The  effect  of  the  synchronous  snapshot  and  the  asynchronous 
snapshot  can  be  clearly  observed:  synchronous  snapshots  stops  execution,  while  the  asynchronous  snapshot 
only  slows  down  execution. 


The  benefits  of  asynchronous  snapshots  become  more  apparent  in  the  multi-tenancy  setting  where  variation 
in  system  performance  exacerbate  the  cost  of  synchronous  operations.  We  simulate  this  on  Amazon  EC2 
by  halting  one  of  the  processes  for  15  seconds  after  snapshot  begins.  In  figures  Figure  5.14b  we  again  plot 
the  number  of  updates  completed  against  time  elapsed  and  we  observe  that  the  asynchronous  snapshot  is 
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(a)  Snapshot 


(b)  Snapshot  with  Delay 


Figure  5.14:  (a)  The  number  of  vertices  updated  vs.  time  elapsed  for  10  iterations  comparing  asynchronous 
and  synchronous  snapshots.  Synchronous  snapshots  (completed  in  109  seconds)  have  the  characteristic 
“flatline”  while  asynchronous  snapshots  (completed  in  104  seconds)  allow  computation  to  proceed,  (b) 
Same  setup  as  in  (a)  but  with  a  single  machine  fault  lasting  15  seconds.  As  a  result  of  the  15  second  delay 
the  asynchronous  snapshot  incurs  only  a  3  second  penalty  while  the  synchronous  snapshot  incurs  a  1 6 
second  penalty. 


minimally  affected  by  the  simulated  failure  (adding  only  3  seconds  to  the  runtime),  while  the  synchronous 
snapshot  experiences  a  full  15  second  increase  in  runtime. 


5.7.4  System  Design 

In  Figure  5.15,  we  provide  a  high-level  overview  of  the  design  of  the  distributed  GraphLab  system.  The 
user  begins  by  constructing  the  atom  graph  representation  on  a  Distributed  File  System  (DFS).  When 
GraphLab  is  launched  on  a  cluster,  one  instance  of  the  GraphLab  program  is  executed  on  each  machine.  The 
GraphLab  processes  are  symmetric  and  directly  communicate  with  each  other  using  a  custom  asynchronous 
RPC  protocol  over  TCP/IP.  The  first  process  has  the  additional  responsibility  of  being  a  master/monitoring 
machine. 

At  launch  the  master  process  computes  the  placement  of  the  atoms  based  on  the  atom  index,  following 
which  all  processes  perform  a  parallel  load  of  the  atoms  they  were  assigned.  Each  process  is  responsible  for 
a  partition  of  the  distributed  graph  that  is  managed  within  a  local  graph  storage,  and  provides  distributed 
locks.  A  cache  is  used  to  provide  access  to  remote  graph  data. 

Each  process  also  contains  a  scheduler  that  manages  the  vertices  in  T  that  have  been  assigned  to  the  process. 
At  runtime,  each  machine’s  local  scheduler  feeds  vertices  into  a  prefetch  pipeline  that  collects  the  data  and 
locks  required  to  execute  the  vertex.  Once  all  data  and  locks  have  been  acquired,  the  vertex  is  executed  by 
a  pool  of  worker  threads.  Vertex  scheduling  is  decentralized  with  each  machine  managing  the  schedule  for 
its  local  vertices  and  forwarding  scheduling  requests  for  remote  vertices.  Finally,  a  distributed  consensus 
algorithm  [Misra,  1983]  is  used  to  determine  when  all  schedulers  arc  empty.  Due  to  the  symmetric  design 
of  the  distributed  runtime,  there  is  no  centralized  bottleneck. 
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(a)  System  Overview 


(b)  Locking  Engine  Design 


Figure  5.15:  (a)  A  high  level  overview  of  the  GraphLab  system.  In  the  initialization  phase  the  atom  file 
representation  of  the  data  graph  is  constructed.  In  the  GraphLab  Execution  phase  the  atom  files  are  assigned 
to  individual  execution  engines  and  are  then  loaded  from  the  DFS.  (b)  A  block  diagram  of  the  parts  of  the 
Distributed  GraphLab  process.  Each  block  in  the  diagram  makes  use  of  the  blocks  below  it.  For  more 
details,  see  Section  5.7.4. 


5.8  Distributed  GraphLab  Evaluation 


We  evaluated  the  distributed  implementation  of  GraphLab  on  three  state-of-the-art  MLDM  applications: 
collaborative  filtering  for  Netflix  movie  recommendations,  Video  Co-segmentation  (CoSeg)  and  Named 
Entity  Recognition  (NER).  Each  experiment  was  based  on  large  real-world  problems  and  datasets  (see 
Table  5.3).  We  used  the  Chromatic  engine  for  the  Netflix  and  NER  applications  and  the  Locking  Engine  for 
the  CoSeg  application.  Equivalent  Fladoop  and  MPI  implementations  were  also  evaluated  on  the  Netflix 
and  NER  applications.  All  timing  experiments  include  loading  time  and  atom  graph  partitioning. 

Experiments  were  performed  on  Amazon's  Elastic  Computing  Cloud  (EC2)  using  up  to  64  Fligh- 
Performance  Cluster  (HPC)  instances  (ccl .  4xlarge)  each  with  dual  Intel  Xeon  X5570  quad-core 
Nehalem  processors  and  22  GB  of  memory  and  connected  by  a  10  Gigabit  Ethernet  network.  All  timings 
include  data  loading  and  are  averaged  over  three  or  more  runs.  On  each  node,  GraphLab  spawns  eight 
engine  threads  (matching  the  number  of  cores). 

In  Figure  5.16a  we  present  an  aggregate  summary  of  the  parallel  speedup  of  GraphLab  when  run  on 
4  to  64  FIPC  machines  on  all  three  applications.  In  all  cases,  speedup  is  measured  relative  to  the  four 
node  deployment  since  single  node  experiments  were  not  always  feasible  due  to  memory  limitations.  No 
snapshots  were  constructed  during  the  timing  experiments  since  all  experiments  completed  prior  to  the  first 
snapshot  under  the  optimal  snapshot  interval  (3  hours)  as  computed  in  Section  5.7.3.  To  provide  intuition 
regarding  the  snapshot  cost,  in  Figure  5.18d  we  plot  for  each  application,  the  overhead  of  compiling  a 
snapshot  on  a  64  machine  cluster. 

Our  principle  findings  arc : 

•  On  equivalent  tasks,  GraphLab  outperforms  Hadoop  by  20-60x  and  performance  is  comparable  to 
tailored  MPI  implementations. 

•  GraphLab’s  performance  scaling  improves  with  higher  computation  to  communication  ratios. 

•  The  GraphLab  abstraction  more  compactly  expresses  the  Netflix,  NER  and  Coseg  algorithms  than 
MapReduce  or  MPI. 
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#Machines 

(b)  Overall  Network  Utilization 


(c)  Netflix  Scaling  with  Intensity 


(d)  Netflix  Comparisons 


Figure  5.16:  (a)  Scalability  of  the  three  test  applications  with  the  largest  input  size.  CoSeg  scales  excellently 
due  to  very  sparse  graph  and  high  computational  intensity.  Netflix  with  default  input  size  scales  moderately 
while  NER  is  hindered  by  high  network  utilization,  (b)  Average  bandwidth  utilization  per  cluster  node. 
Netflix  and  CoSeg  have  very  low  bandwidth  requirements  while  NER  appears  to  saturate  when  #machines 
>  24.  (c)  Scalability  of  Netflix  varying  the  computation  cost  of  the  vertex  program,  (d)  Runtime  of  Netflix 
with  GraphLab,  Hadoop  and  MPI  implementations.  Note  the  logarithmic  scale.  GraphLab  outperforms 
Hadoop  by  40-60x  and  is  comparable  to  an  MPI  implementation.  See  Section  5.8.1  and  Section  5.8.3  for  a 
detailed  discussion. 


5.8.1  Netflix  Movie  Recommendation 


The  Netflix  movie  recommendation  task  uses  collaborative  filtering  to  predict  the  movie  ratings  for  each 
user,  based  on  the  ratings  of  similar  users.  We  implemented  the  alternating  least  squares  (ALS)  algorithm 
Zhou  et  al.  [2008],  a  common  algorithm  in  collaborative  filtering.  The  input  to  ALS  is  a  sparse  users  by 
movies  matrix  R,  containing  the  movie  ratings  of  each  user.  The  algorithm  iteratively  computes  a  low-rank 
matrix  factorization: 
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where  U  and  V  are  rank  d  matrices.  The  ALS  algorithm  alternates  between  computing  the  least-squares 
solution  for  U  and  V  while  holding  the  other  fixed.  Both  the  quality  of  the  approximation  and  the 


117 


computational  complexity  depend  on  the  magnitude  of  d:  higher  d  produces  higher  accuracy  while 
increasing  computational  cost.  Collaborative  filtering  and  the  ALS  algorithm  are  important  tools  in  MLDM: 
an  effective  solution  for  ALS  can  be  extended  to  a  broad  class  of  other  applications. 

While  ALS  may  not  seem  like  a  graph  algorithm,  it  can  be  represented  easily  using  the  GraphLab  abstraction. 
The  sparse  matrix  R  defines  a  bipartite  graph  connecting  each  user  with  the  movies  they  rated.  The  edge  data 
contains  the  rating  for  a  movie-user  pair.  The  vertex  data  for  users  and  movies  contains  the  corresponding 
row  in  U  and  column  in  V  respectively.  The  GraphLab  vertex  program  recomputes  the  d  length  vector  for 
each  vertex  by  reading  the  d  length  vectors  on  adjacent  vertices  and  then  solving  a  least-squares  regression 
problem  to  predict  the  edge  values.  The  chromatic  engine  is  used  since  the  graph  is  bipartite,  two  colorable, 
and  the  edge  serializability  model  is  sufficient  for  serializability. 

The  Netflix  task  provides  us  with  an  opportunity  to  quantify  the  distributed  chromatic  engine  overhead  since 
we  are  able  to  directly  control  the  computation-communication  ratio  by  manipulating  d:  the  dimensionality 
of  the  approximating  matrices  in  Eq.  (5.7).  In  Figure  5.16c  we  plot  the  speedup  achieved  for  varying  values 
of  d  and  the  corresponding  number  of  cycles  required  per  update.  Extrapolating  to  obtain  the  theoretically 
optimal  runtime,  we  estimated  the  overhead  of  Distributed  GraphLab  at  64  machines  (512  CPUs)  to  be 
about  12x  for  d  =  5  and  about  4.9x  for  d  =  100.  Note  that  this  overhead  includes  graph  loading  and 
communication.  This  provides  us  with  a  measurable  objective  for  future  optimizations. 

Next,  we  compare  against  a  Hadoop  and  an  MPI  implementation  in  Figure  5.16d  (d  =  20  for  all  cases), 
using  between  4  to  64  machines.  The  Hadoop  implementation  is  part  of  the  Mahout  2  project  and  is  widely 
used.  Since  fault  tolerance  was  not  needed  during  our  experiments,  we  reduced  the  Hadoop  Distributed 
Filesystem’s  (HDFS)  replication  factor  to  one.  A  significant  amount  of  our  effort  was  then  spent  tuning 
the  Hadoop  job  parameters  to  improve  performance.  However,  even  so  we  find  that  GraphLab  performs 
between  40-60  times  faster  than  Hadoop. 

While  some  of  the  Hadoop  inefficiency  may  be  attributed  to  Java,  job  scheduling,  and  various  design 
decisions,  GraphLab  also  leads  to  a  more  efficient  representation  of  the  underlying  algorithm.  We  can 
observe  that  the  Map  function  of  a  Hadoop  ALS  implementation,  performs  no  computation  and  its  only 
puipose  is  to  emit  copies  of  the  vertex  data  for  every  edge  in  the  graph;  unnecessarily  multiplying  the 
amount  of  data  that  need  to  be  tracked.  For  example,  a  user  vertex  that  connects  to  100  movies  must  emit 
the  data  on  the  user  vertex  100  times,  once  for  each  movie.  This  results  in  the  generation  of  a  large  amount 
of  unnecessary  network  traffic  and  unnecessary  HDFS  writes. 

This  weakness  extends  beyond  the  MapReduce  abstraction  and  affects  the  graph  message -passing  models 
(such  as  Pregel)  due  to  the  lack  of  a  scatter  operation  that  would  avoid  sending  same  value  multiple  times 
to  each  machine.  Comparatively,  the  GraphLab  vertex  program  is  simpler  as  users  do  not  need  to  explicitly 
define  the  flow  of  information.  Synchronization  of  a  modified  vertex  only  requires  as  much  communication 
as  there  arc  ghosts  of  the  vertex.  In  particular,  only  machines  that  require  the  vertex  data  for  computation 
will  receive  it,  and  each  machine  receives  each  modified  vertex  data  at  most  once,  even  if  the  vertex  has 
many  neighbors. 

Our  MPI  implementation  of  ALS  is  relatively  optimized,  and  uses  synchronous  MPI  collective  operations 
for  communication.  The  computation  is  broken  into  super-steps  that  alternate  between  recomputing  the 
latent  user  and  movies  low  rank  matrices.  Between  super-steps  the  new  user  and  movie  values  are  scattered 
(using  MPLAlltoall)  to  the  machines  that  need  them  in  the  next  super-step.  As  a  consequence  our  MPI 
implementation  of  ALS  is  roughly  equivalent  to  an  optimized  Pregel  version  of  ALS  with  added  support  for 

2https://issues.  apache.  org/jira/browse/MAHOUT-542 
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parallel  broadcasts.  Surprisingly,  GraphLab  was  able  to  outperform  the  MPI  implementation.  We  attribute 
the  performance  to  the  use  of  background  asynchronous  communication  in  GraphLab. 

Finally,  we  can  evaluate  the  effect  of  enabling  dynamic  computation.  In  Figure  5.19a,  we  plot  the  test  error 
obtained  over  time  using  a  dynamic  update  schedule  as  compared  to  a  static  BSP-style  update  schedule. 
This  dynamic  schedule  is  easily  represented  in  GraphLab  while  it  is  difficult  to  express  using  Pregel 
messaging  semantics.  We  observe  that  a  dynamic  schedule  converges  much  faster,  reaching  a  low  test  error 
using  less  than  half  the  amount  of  work. 


5.8.2  Video  Co-segmentation  (CoSeg) 

Video  co-segmentation  automatically  identifies  and  clusters  spatio-temporal  segments  of  video  (Fig¬ 
ure  5.17a)  that  share  similar  texture  and  color  characteristics.  The  resulting  segmentation  (Figure  5.17a) 
can  be  used  in  scene  understanding  and  other  computer  vision  and  robotics  applications.  Previous  co¬ 
segmentation  methods  Batra  et  al.  [2010]  have  focused  on  processing  frames  in  isolation.  Instead,  we 
developed  a  joint  co-segmentation  algorithm  that  processes  all  frames  concurrently  and  is  able  to  model 
temporal  stability. 

We  preprocessed  1,740  frames  of  high-resolution  video  by  coarsening  each  frame  to  a  regular  grid  of 
120  x  50  rectangular  super-pixels.  Each  super-pixel  stores  the  color  and  texture  statistics  for  all  the  raw 
pixels  in  its  domain.  The  CoSeg  algorithm  predicts  the  best  label  (e.g.,  sky,  building,  grass,  pavement, 
trees)  for  each  super  pixel  using  Gaussian  Mixture  Model  (GMM)  in  conjunction  with  belief  propagation 
(Chapter  3).  The  GMM  estimates  the  best  label  given  the  color  and  texture  statistics  in  the  super-pixel. 
The  algorithm  operates  by  connecting  neighboring  pixels  in  time  and  space  into  a  large  three-dimensional 
grid  and  uses  belief  propagation  to  smooth  the  local  estimates.  We  combined  the  two  algorithms  to  form 
an  asynchronous  Expectation-Maximization  algorithm,  which  concurrently  estimates  the  label  for  each 
super-pixel  given  the  GMM  and  updates  the  GMM  given  the  labels  from  belief  propagation. 

The  GraphLab  vertex  program  executes  the  belief  propagation  iterative  updates.  We  implement  the  state- 
of-the-art  adaptive  belief  residual  schedule  described  in  Chapter  3,  where  updates  that  are  expected  to 
change  vertex  values  significantly  are  prioritized.  We  therefore  make  use  of  the  locking  engine  with  a 
priority  scheduler.  The  parameters  for  the  GMM  arc  maintained  using  the  sync  operation.  To  the  best  of 
our  knowledge,  there  are  no  other  abstractions  that  provide  the  dynamic  asynchronous  scheduling  as  well 
as  the  sync  (reduction)  capabilities  required  by  this  application. 

In  Figure  5.16a  we  demonstrate  that  the  locking  engine  can  achieve  scalability  and  performance  on  the 
large  10.5  million  vertex  graph  used  by  this  application,  resulting  in  a  lOx  speedup  with  16x  more  machines. 
We  also  observe  from  Figure  5.18a  that  the  locking  engine  provides  nearly  optimal  weak  scaling:  the 
runtime  does  not  increase  significantly  as  the  size  of  the  graph  increases  proportionately  with  the  number  of 
machines.  We  can  attribute  this  to  the  properties  of  the  graph  partition  where  the  number  of  edges  crossing 
machines  increases  linearly  with  the  number  of  machines,  resulting  in  low  communication  volume. 

While  Section  5.7.2  contains  a  limited  evaluation  of  the  pipelining  system  on  a  synthetic  graph,  here  we 
further  investigate  the  behavior  of  the  distributed  lock  implementation  when  run  on  a  complete  problem  that 
makes  use  of  all  key  aspects  of  GraphLab:  both  sync  and  dynamic  prioritized  scheduling.  The  evaluation  is 
performed  on  a  small  32-frame  (192K  vertices)  problem  using  a  4  node  cluster  and  two  different  partitioning. 
An  optimal  partition  was  constructed  by  evenly  distributing  8  frame  blocks  to  each  machine.  A  worst  case 
partition  was  constructed  by  striping  frames  across  machines  and  consequently  stressing  the  distributed 
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lock  implementation  by  forcing  each  scope  acquisition  is  to  grab  at  least  one  remote  lock.  We  also  vary  the 
maximum  length  of  the  pipeline.  Results  arc  plotted  in  Figure  5.18b.  We  demonstrate  that  increasing  the 
length  of  the  pipeline  increases  performance  significantly  and  is  able  to  compensate  for  poor  partitioning, 
rapidly  bringing  down  the  runtime  of  the  problem.  Just  as  in  Section  5.7.2,  we  observe  diminishing  returns 
with  increasing  pipeline  length. 

We  conclude  that  for  the  video  co-segmentation  task,  Distributed  GraphLab  provides  excellent  performance 
while  being  the  only  distributed  graph  abstraction  that  allows  the  use  of  dynamic  prioritized  schedul¬ 
ing.  In  addition,  the  pipelining  system  is  an  effective  way  to  hide  latency,  and  to  some  extent,  a  poor 
partitioning. 


5.8.3  Named  Entity  Recognition  (NER) 


Here  we  return  to  the  NER  task  described  in  the  multicore  experiments  in  Section  5.6.3  using  the  same  data 
and  vertex  program.  Since  the  graph  is  two  colorable  and  relatively  dense  the  chromatic  engine  was  used 
with  random  partitioning.  The  lightweight  floating  point  arithmetic  in  the  NER  computation  in  conjunction 
with  the  relatively  dense  graph  structure  and  random  partitioning  is  essentially  the  worst-case  for  the  current 
Distributed  GraphLab  design,  and  thus  allow  us  to  evaluate  the  overhead  of  the  Distributed  GraphLab 
runtime. 

From  Figure  5.16a  we  see  that  NER  achieved  only  a  modest  3x  improvement  using  16x  more  machines.  We 
attribute  the  poor  scaling  performance  of  NER  to  the  large  vertex  data  size  (816  bytes),  dense  connectivity, 
and  poor  partitioning  (random  cut)  that  resulted  in  substantial  communication  overhead  per  iteration. 
Figure  5.16b  shows  for  each  application,  the  average  number  of  bytes  per  second  transmitted  by  each 
machine  with  varying  size  deployments.  On  four  nodes,  NER  transmits  over  10  times  more  data  than 
Netflix  and  almost  100  times  as  much  data  as  CoSeg.  Beyond  16  machines,  NER  saturates  with  each 
machine  sending  at  a  rate  of  over  100MB  per  second. 

We  evaluated  our  Distributed  GraphLab  implementation  against  a  Hadoop  and  an  MPI  implementation  in 
Figure  5.18c.  In  addition  to  the  optimizations  listed  in  Section  5.8.1,  our  Hadoop  implementation  required 
the  use  of  binary  serialization  methods  to  obtain  reasonable  performance  (decreasing  runtime  by  5x  from 
baseline).  We  demonstrate  that  GraphLab  implementation  of  NER  was  able  to  obtains  a  20-30x  speedup 
over  Hadoop.  The  reason  for  the  performance  gap  is  the  same  as  that  for  the  Netflix  evaluation.  Since  each 
vertex  emits  a  copy  of  itself  for  each  edge:  in  the  extremely  large  CoEM  graph,  this  corresponds  to  over 
100  GB  of  HDFS  writes  occurring  between  the  Map  and  Reduce  stage. 

On  the  other  hand,  our  MPI  implementation  was  able  to  outperform  Distributed  GraphLab.  The  CoEM 
task  requires  extremely  little  computation  in  comparison  to  the  amount  of  data  it  touches.  We  were 
able  to  evaluate  that  the  NER  vertex  program  requires  5.7x  fewer  cycles  per  byte  of  data  accessed  as 
compared  to  the  Netflix  problem  at  d  =  5  (the  hardest  Netflix  case  evaluated).  The  extremely  poor 
computation  to  communication  ratio  stresses  our  communication  implementation,  that  is  outperformed 
by  MPI's  efficient  communication  layer.  Furthermore,  Figure  5.16b  provides  further  evidence  that  we 
fail  to  fully  saturate  the  network  (that  offers  lOGbps).  Further  optimizations  to  eliminate  inefficiencies  in 
GraphLab ’s  communication  layer  should  bring  us  up  to  parity  with  the  MPI  implementation. 
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Exp. 

#Verts 

#Edges 

Vert.  Data 

Edge  Data 

Complexity 

Shape 

Partition 

Engine 

Netflix 

0.5M 

99M 

8d  +  13 

16 

0(d3  +  deg.) 

bipartite 

random 

Chromatic 

CoSeg 

10.5M 

31M 

392 

80 

0(deg.) 

3D  grid 

frames 

Locking 

NER 

2M 

200M 

816 

4 

0(deg.) 

bipartite 

random 

Chromatic 

Table  5.3:  Experiment  input  sizes.  The  vertex  and  edge  data  are  measured  in  bytes  and  the  d  in  Netflix  is 
the  size  of  the  latent  dimension. 


(a)  Coseg  Video  Frame  (b)  Segmentation 


Figure  5.17:  Coseg:  (a)  A  frame  from  the  original  video  sequence  and  the  result  of  running  the  co¬ 
segmentation  algorithm,  (b)  The  predicted  segmentation. 

5.8.4  EC2  Cost  evaluation 

To  illustrate  the  monetary  cost  of  using  the  alternative  abstractions,  we  plot  the  price -runtime  curve  for 
the  Netflix  application  in  Figure  5.19b  in  log-log  scale.  All  costs  are  computed  using  fine-grained  billing 
rather  than  the  hourly  billing  used  by  Amazon  EC2.  The  price-runtime  curve  demonstrates  diminishing 
returns:  the  cost  of  attaining  reduced  runtimes  increases  faster  than  linearly.  As  a  comparison,  we  provide 
the  price-runtime  curve  for  Hadoop  on  the  same  application.  For  the  Netflix  application,  GraphLab  is  about 
two  orders  of  magnitude  more  cost-effective  than  Hadoop. 


5.9  Additional  Related  Work 


Section  5.3  provides  a  detailed  comparison  of  several  contemporary  high-level  parallel  and  distributed 
frameworks.  In  this  section  we  review  related  work  in  classic  parallel  abstractions,  graph  databases,  and 
domain  specific  languages. 

There  has  been  substantial  work  [Angles  and  Gutierrez,  2008]  in  graph  structured  databases  dating  back  to 
the  1980’s  along  with  several  recent  open-source  and  commercial  products  (e.g.,  Neo4j  [201 1]).  Graph 
databases  typically  focus  on  efficient  storage  and  retrieval  of  graph  structured  data  with  support  for  basic 
graph  computation.  In  contrast,  GraphLab  focuses  GrAD  style  computation  on  large  graphs. 

There  are  several  notable  projects  focused  on  using  MapReduce  for  graph  computation.  Pegasus  [Kang 
et  al.,  2009]  is  a  collection  of  algorithms  for  mining  large  graphs  using  Hadoop.  Surfer  [Chen  et  al., 
2010]  extends  MapReduce  with  a  propagation  primitive,  but  does  not  support  asynchronous  or  dynamic 
scheduling.  Alternatively,  Lattanzi  et  al.  [201 1]  proposed  subsampling  large-graphs  (using  MapReduce)  to  a 
size  which  can  be  processed  on  a  single  machine.  While  Lattanzi  et  al.  [2011]  was  able  to  derive  reductions 
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for  some  graph  problems  (e.g.,  minimum  spanning  tree),  the  techniques  arc  not  easily  generalizable  and 
may  not  be  applicable  to  many  MLDM  algorithms. 


5.10  Conclusion 

The  GrAD  methodology  emphasizes  the  importance  of  explicitly  modeling  sparse  computational  dependen¬ 
cies,  asynchronous  computation,  and  dynamic  scheduling  in  large  scale  MLDM  problems.  In  this  chapter 
we  described  how  existing  high-level  parallel  and  distributed  abstractions  fail  to  support  all  three  critical 
properties  of  the  GrAD  methodology.  To  address  these  properties  we  introduced  the  GraphLab  abstraction, 
a  graph-parallel  framework  that  explicitly  targets  GrAD  algorithms  and  generalizes  our  earlier  work  on 
parallel  belief  propagation  and  Gibbs  sampling. 

The  GraphLab  abstraction  introduces  the  data  graph  to  decompose  the  program  state  along  the  sparse 
graphical  dependency  structuring.  User  defined  computation  in  GraphLab  is  expressed  in  the  form  of  vertex 
programs  that  run  asynchronously  on  each  vertex  and  can  directly  read  and  modify  values  on  neighboring 
edges  and  vertices.  GraphLab  supports  a  rich  set  of  schedulers  that  allow  vertex  programs  to  dynamically 
trigger  and  prioritize  computation  on  neighboring  vertices.  To  simplify  reasoning  about  highly  parallel  and 
distributed  asynchronous  executions,  GraphLab  introduces  a  set  of  simple  mutual  exclusion  policies  which 
ensure  a  serializable  execution.  As  a  consequence,  we  can  often  directly  apply  serial  analysis  techniques  to 
programs  implemented  within  the  GraphLab  abstraction.  Finally,  to  manage  sharing  and  aggregation  of 
global  state,  GraphLab  provides  a  powerful  aggregation  framework. 

We  developed  optimized  shared-memory  and  distributed  implementations  of  the  GraphLab  abstraction  by 
extending  the  techniques  used  to  build  scalable  systems  for  belief  propagation  and  Gibbs  sampling.  In  the 
distributed  setting  we  introduced  a  novel  pipelined  locking  protocol  and  multistage  partitioning  and  were 
able  to  achieve  faul-tolerance  by  directly  exploiting  the  key  serializability  properties  of  the  abstraction. 
We  evaluated  our  implementation  on  multi-core  and  distributed  systems  and  demonstrated  strong  scaling 
and  state-of-the-art  performance  on  real-world  applications.  By  implementing  a  wide  range  of  algorithms 
we  demonstrated  the  expressiveness  of  the  GraphLab  abstraction  within  the  context  of  a  single  MLDM 
application.  In  the  distributed  setting  we  demonstrated  that  GraphLab  significantly  outperforms  Hadoop  by 
20-60x,  and  is  competitive  with  application  specific  MPI  implementations. 

The  GraphLab  API  has  the  opportunity  to  be  an  interface  between  the  ML  and  systems  communities. 
Parallel  ML  algorithms  built  around  the  GraphLab  API  automatically  benefit  from  developments  in  parallel 
data  structures.  As  new  locking  protocols  and  parallel  scheduling  primitives  are  incorporated  into  the 
GraphLab  API,  they  become  immediately  available  to  the  ML  community.  Systems  experts  can  more  easily 
port  ML  algorithms  to  new  parallel  hardware  by  porting  the  GraphLab  API.  Future  implementations  of 
GraphLab  will  benefit  from  research  in  system  architectures  and  parallel  data  structures.  These  gains  arc 
automatically  conferred  to  all  algorithms  built  in  the  GraphLab  framework. 
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(a)  CoSeg  Weak  Scaling 


(b)  Pipelined  Locking 


(c)  NER  Comparisons 


(d)  Snapshot  Overhead 


Figure  5.18:  (a)  Runtime  of  the  CoSeg  experiment  as  data  set  size  is  scaled  proportionately  with  the  number 
of  machines.  Ideally,  runtime  is  constant.  GraphLab  experiences  an  11%  increase  in  runtime  scaling  from 
16  to  64  machines,  (b)  The  performance  effects  of  varying  the  length  of  the  pipeline.  Increasing  the  pipeline 
length  has  a  small  effect  on  performance  when  partitioning  is  good.  When  partitioning  is  poor,  increasing 
the  pipeline  length  improves  performance  to  be  comparable  to  that  of  optimal  partitioning.  Runtime  for 
worst-case  partitioning  at  pipeline  length  0  is  omitted  due  to  excessive  runtimes,  (c)  Runtime  of  the  NER 
experiment  with  Distributed  GraphLab,  Fladoop  and  MPI  implementations.  Note  the  logarithmic  scale. 
GraphLab  outperforms  Hadoop  by  about  80x  when  the  number  of  machines  is  small,  and  about  30x  when 
the  number  of  machines  is  large.  The  performance  of  Distributed  GraphLab  is  comparable  to  the  MPI 
implementation,  (d)  Lor  each  application,  the  overhead  of  performing  a  complete  snapshot  of  the  graph 
every  \V\  updates  (where  V  is  the  number  of  vertices  in  the  graph),  when  running  on  a  64  machine  cluster. 
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(a)  Dynamic  Netflix  (b)  EC2  Price/Performance 


Figure  5.19:  (a)  Convergence  rate  when  dynamic  computation  is  used.  Dynamic  computation  can  converge 
to  equivalent  test  error  in  about  half  the  number  of  updates,  (b)  Price  Performance  ratio  of  GraphLab 
and  Hadoop  on  running  on  Amazon  EC2  HPC  nodes  in  a  log-log  scale.  Costs  were  computing  assuming 
line-grained  billing. 
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Chapter  6 


PowerGraph:  Scaling  GraphLab  to  the 
Web  and  Beyond 


In  the  previous  chapters  we  discussed  how  the  GrAD  methodology  can  lead  to  efficient  large-scale 
probabilistic  reasoning  on  graph-structured  data.  We  then  introduced  the  GraphLab  abstraction  which 
generalizes  the  GrAD  methodology,  simplifying  the  design  and  implementation  of  parallel  and  distributed 
graph-parallel  computation.  However,  in  the  distributed  setting  GraphLab  began  to  struggle  on  web-scale 
graphs  and  even  on  smaller  graphs  with  light-weight  computation  (e.g.,  CoEM  Section  5.8.3)  we  found 
that  traditional  graph  partitioning  techniques  failed  and  communication  overhead  dominated.  Upon  deeper 
analysis  we  discovered  that  the  root  cause  of  these  limitations  arrises  from  the  power-law  sparsity  structure 
common  to  a  wide  range  or  real-world  graphs.  Furthermore,  these  limitations  are  not  unique  to  GraphLab 
but  instead  are  present  in  a  wide  range  of  existing  parallel  abstractions. 

In  this  chapter,  we  characterize  the  challenges  of  computation  on  natural  graphs  in  the  context  of  existing 
graph-parallel  abstractions.  We  then  introduce  the  PowerGraph  abstraction  which  exploits  the  internal 
structure  of  graph  programs  to  address  these  challenges.  Leveraging  the  PowerGraph  abstraction  we 
introduce  a  new  approach  to  distributed  graph  placement  and  representation  that  exploits  the  structure  of 
power-law  graphs.  We  provide  a  detailed  analysis  and  experimental  evaluation  comparing  PowerGraph 
to  two  popular  graph-parallel  systems.  Finally,  we  describe  three  different  implementation  strategies  for 
PowerGraph  and  discuss  their  relative  merits  with  empirical  evaluations  on  large-scale  real-world  problems 
demonstrating  order  of  magnitude  gains. 


6.1  Introduction 


The  increasing  need  to  reason  about  large-scale  graph-structured  data  in  machine  learning  and  data  mining 
(MLDM)  presents  a  critical  challenge.  As  the  sizes  of  datasets  grow,  statistical  theory  suggests  that  we 
should  apply  richer  models  to  eliminate  the  unwanted  bias  of  simpler  models,  and  extract  stronger  signals 
from  data.  At  the  same  time,  the  computational  and  storage  complexity  of  richer  models  coupled  with 
rapidly  growing  datasets  have  exhausted  the  limits  of  single  machine  computation. 

The  resulting  demand  has  driven  the  development  of  new  graph-parallel  abstractions  such  as  Pregel 
[Malewicz  et  al.,  2010]  and  GraphLab  that  encode  computation  as  vertex-programs  which  run  in  parallel 
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and  interact  along  edges  in  the  graph.  Graph-parallel  abstractions  rely  on  each  vertex  having  a  small 
neighborhood  to  maximize  parallelism  and  effective  partitioning  to  minimize  communication.  However, 
graphs  derived  from  real-world  phenomena,  like  social  networks  and  the  web,  typically  have  power-law 
degree  distributions,  which  implies  that  a  small  subset  of  the  vertices  connects  to  a  large  fraction  of  the 
graph.  Furthermore,  power-law  graphs  arc  difficult  to  partition  [Abou-Rjeili  and  Karypis,  2006,  Leskovec 
et  al.,  2008]  and  represent  in  a  distributed  environment. 

To  address  the  challenges  of  power-law  graph  computation,  we  introduce  the  PowerGraph  abstraction  which 
exploits  the  structure  of  vertex -programs  and  explicitly  factors  computation  over  edges  instead  of  vertices. 
As  a  consequence,  PowerGraph  exposes  substantially  greater  parallelism,  reduces  network  communication 
and  storage  costs,  and  provides  a  new  highly  effective  approach  to  distributed  graph  placement.  We  describe 
the  design  of  our  distributed  implementation  of  PowerGraph  and  evaluate  it  on  a  large  EC2  deployment 
using  real-world  applications.  In  particular  our  key  contributions  are: 

1 .  An  analysis  of  the  challenges  of  power-law  graphs  in  distributed  graph  computation  and  the  limita¬ 
tions  of  existing  graph  parallel  abstractions  (Sec.  6.2  and  6.3). 

2.  The  PowerGraph  abstraction  (Section  6.4)  which  factors  individual  vertex-programs. 

3.  A  delta  caching  procedure  which  allows  computation  state  to  be  dynamically  maintained  (Sec¬ 
tion  6.4.2). 

4.  A  new  fast  approach  to  data  layout  for  power-law  graphs  in  distributed  environments  (Section  6.5). 

5.  An  theoretical  characterization  of  network  and  storage  (Theorem  6.5.2,  Theorem  6.5.3). 

6.  A  high-performance  open-source  implementation  of  the  PowerGraph  abstraction  (Section  6.7). 

7.  A  comprehensive  evaluation  of  three  implementations  of  PowerGraph  on  a  large  EC2  deployment 
using  real-world  MLDM  applications  (Section  6.6  and  6.7). 

6.2  Graph-Parallel  Abstractions 


A  graph-parallel  abstraction  consists  of  a  sparse  graph  G  =  { V.  E }  and  a  vertex-program  Q  which 
is  executed  in  parallel  on  each  vertex  v  €  V  and  can  interact  (e.g.,  through  shared-state  in  GraphLab, 
or  messages  in  Pregel)  with  neighboring  instances  Q(u)  where  (it,  v)  e  E.  In  contrast  to  more  general 
message  passing  models,  graph-parallel  abstractions  constrain  the  interaction  of  vertex -program  to  a  graph 
structure  enabling  the  optimization  of  data-layout  and  communication.  We  focus  our  discussion  on  Pregel 
and  GraphLab  as  they  are  representative  of  existing  graph-parallel  abstractions. 


6.2.1  Pregel 

Pregel,  introduced  by  Malewicz  et  al.  [2010],  is  a  bulk  synchronous  message  passing  abstraction  in  which 
all  vertex-programs  run  simultaneously  in  a  sequence  of  super-steps.  Within  a  super-step  each  program 
instance  Q(v)  receives  all  messages  from  the  previous  super-step  and  sends  messages  to  its  neighbors  in 
the  next  super-step.  A  barrier  is  imposed  between  super-steps  to  ensure  that  all  program  instances  finish 
processing  messages  from  the  previous  super-step  before  proceeding  to  the  next.  The  program  terminates 
when  there  are  no  messages  remaining  and  every  program  has  voted  to  halt.  Pregel  introduces  commutative 
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associative  message  combiners  which  are  user  defined  functions  that  merge  messages  destined  to  the 
same  vertex.  The  following  is  an  example  of  the  PageRank  vertex-program  implemented  in  Pregel.  The 
vertex-program  receives  the  single  incoming  message  (after  the  combiner)  which  contains  the  sum  of  the 
PageRanks  of  all  in-neighbors.  The  new  PageRank  is  then  computed  and  sent  to  its  out-neighbors. 

Message  combiner (Message  ml.  Message  m2)  : 

return  Message (ml . value ( )  +  m2. value ()); 
void  PregelPageRank (Message  msg)  : 
float  total  =  msg. value (); 
vertex. val  =  0.15  +  0.85*total; 
foreach(nbr  in  out_neighbors )  : 

SendMsg(nbr,  vertex . val/num_out_nbrs ) ; 


6.2.2  GraphLab 

GraphLab  (Chapter  5)  is  an  asynchronous  distributed  shared-memory  abstraction  in  which  vertex -programs 
have  shared  access  to  a  distributed  graph  with  data  stored  on  every  vertex  and  edge.  Each  vertex-program 
may  directly  access  information  on  the  current  vertex,  adjacent  edges,  and  adjacent  vertices  irrespective  of 
edge  direction.  Vertex-programs  can  schedule  neighboring  vertex -programs  to  be  executed  in  the  future. 
GraphLab  ensures  serializability  by  preventing  neighboring  program  instances  from  running  simultaneously. 
The  following  is  an  example  of  the  PageRank  vertex-program  implemented  in  GraphLab.  The  GraphLab 
vertex-program  directly  reads  neighboring  vertex  values  to  compute  the  sum. 

void  GraphLabPageRank ( Scope  scope)  : 
float  accum  =  0; 

foreach  (nbr  in  scope . in_nbrs )  : 

accum  +=  nbr. val  /  nbr . nout_nbrs ( ) ; 
vertex. val  =  0.15  +  0.85  *  accum; 


By  eliminating  messages,  GraphLab  isolates  the  user  defined  algorithm  from  the  movement  of  data,  allowing 
the  system  to  choose  when  and  how  to  move  program  state.  By  allowing  mutable  data  to  be  associated  with 
both  vertices  and  edges  GraphLab  allows  the  algorithm  designer  to  more  precisely  distinguish  between 
data  shared  with  all  neighbors  (vertex  data)  and  data  shared  with  a  particular  neighbor  (edge  data). 


6.2.3  GAS  Decomposition 

While  the  implementation  of  MLDM  vertex-programs  in  GraphLab  and  Pregel  differ  in  how  they  collect 
and  disseminate  information,  they  share  a  common  overall  structure.  To  characterize  this  common  structure 
and  differentiate  between  vertex  and  edge  specific  computation  we  introduce  the  GAS  model  of  graph 
computation. 

The  GAS  model  represents  three  conceptual  phases  of  a  vertex-program:  Gather,  Apply,  and  Scatter.  In 
the  gather  phase,  information  about  adjacent  vertices  and  edges  is  collected  through  a  generalized  sum 
over  the  neighborhood  of  the  vertex  u  on  which  Q(u)  is  run: 

g  (Du,  Du^.v,  Dv) .  (6.1) 

V&Mu 
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Figure  6.1:  The  in  and  out  degree  distributions  of  the  Twitter  follower  network  plotted  in  log-log  scale. 


where  Du,  Dv,  and  Du^.v  are  the  values  (program  state  and  meta-data)  for  vertices  u  and  v  and  edge  (?/,  v). 
The  user  defined  sum  ®  operation  must  be  commutative  and  associative  and  can  range  from  a  numerical 
sum  to  the  union  of  the  data  on  all  neighboring  vertices  and  edges. 

The  resulting  value  T,  is  used  in  the  apply  phase  to  update  the  value  of  the  central  vertex: 

D™  <-  a  (Du,  £) .  (6.2) 

Finally  the  scatter  phase  uses  the  new  value  of  the  central  vertex  to  update  the  data  on  adjacent  edges: 

W  €  Mu  :  {Du^v)  <-  s  (L>”ew,  Du^v,  Dv) .  (6.3) 

The  fan-in  and  fan-out  of  a  vertex-program  is  determined  by  the  corresponding  gather  and  scatter  phases.  For 
instance,  in  PageRank,  the  gather  phase  only  operates  on  in-edges  and  the  scatter  phase  only  operates  on  out- 
edges.  However,  for  many  MLDM  algorithms  the  graph  edges  encode  ostensibly  symmetric  relationships, 
like  friendship,  in  which  both  the  gather  and  scatter  phases  touch  all  edges.  In  this  case  the  fan-in  and 
fan-out  are  equal.  As  we  will  show  in  Section  6.3,  the  ability  for  graph  parallel  abstractions  to  support  both 
high  fan-in  and  fan-out  computation  is  critical  for  efficient  computation  on  natural  graphs. 

GraphLab  and  Pregel  express  GAS  programs  in  very  different  ways.  In  the  Pregel  abstraction  the  gather 
phase  is  implemented  using  message  combiners  and  the  apply  and  scatter  phases  are  expressed  in  the  vertex 
program.  Conversely,  GraphLab  exposes  the  entire  neighborhood  to  the  vertex-program  and  allows  the  user 
to  define  the  gather  and  apply  phases  within  their  program.  The  GraphLab  abstraction  implicitly  defines 
the  communication  aspects  of  the  gather/scatter  phases  by  ensuring  that  changes  made  to  the  vertex  or 
edge  data  are  automatically  visible  to  adjacent  vertices.  It  is  also  important  to  note  that  GraphLab  does  not 
differentiate  between  edge  directions. 


6.3  Challenges  of  Natural  Graphs 


The  sparsity  structure  of  natural  graphs  presents  a  unique  challenge  to  efficient  distributed  graph-parallel 
computation.  One  of  the  hallmark  properties  of  natural  graphs  is  their  skewed  power-law  degree  distribution 
[Faloutsos  et  ah,  1999]:  most  vertices  have  relatively  few  neighbors  while  a  few  have  many  neighbors  (e.g., 
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celebrities  in  a  social  network).  Under  a  power-law  degree  distribution  the  probability  that  a  vertex  has 
degree  d  is  given  by: 

P  (d)  oc  d~a,  (6.4) 

where  the  exponent  a  is  a  positive  constant  that  controls  the  “skewness”  of  the  degree  distribution.  Higher  a 
implies  that  the  graph  has  lower  density  (ratio  of  edges  to  vertices),  and  that  the  vast  majority  of  vertices  are 
low  degree.  As  a  decreases,  the  graph  density  and  number  of  high  degree  vertices  increases.  Most  natural 
graphs  typically  have  a  power-law  constant  around  o:  ~  2.  For  example,  Faloutsos  et  al.  [1999]  estimated 
that  the  inter-domain  graph  of  the  Internet  has  a  power-law  constant  a  &  2.2.  One  can  visualize  the  skewed 
power-law  degree  distribution  by  plotting  the  number  of  vertices  with  a  given  degree  in  log-log  scale.  In 
Figure  6.1,  we  plot  the  in  and  out  degree  distributions  of  the  Twitter  follower  network  demonstrating  the 
characteristic  linear  power-law  form. 

While  power-law  degree  distributions  are  empirically  observable,  they  do  not  fully  characterize  the 
properties  of  natural  graphs.  While  there  has  been  substantial  work  (see  Leskovec  et  al.  [2007])  on 
more  sophisticated  natural  graph  models,  the  techniques  in  this  chapter  focus  only  on  the  degree  distribution 
and  do  not  require  any  other  modeling  assumptions. 

The  skewed  degree  distribution  implies  that  a  small  fraction  of  the  vertices  are  adjacent  to  a  large  fraction 
of  the  edges.  For  example,  one  percent  of  the  vertices  in  the  Twitter  web-graph  arc  adjacent  to  nearly  half 
of  the  edges.  This  concentration  of  edges  results  in  a  star-like  motif  which  presents  challenges  for  existing 
graph-parallel  abstractions: 

Work  Balance:  The  power-law  degree  distribution  can  lead  to  substantial  work  imbalance  in  graph 
parallel  abstractions  that  treat  vertices  symmetrically.  Since  the  storage,  communication,  and  computation 
complexity  of  the  Gather  and  Scatter  phases  is  linear  in  the  degree,  the  running  time  of  vertex-programs 
can  vary  widely  [Suri  and  Vassilvitskii,  2011]. 

Partitioning:  Natural  graphs  are  difficult  to  partition  [Lang,  2004,  Leskovec  et  al.,  2008].  Both  GraphLab 
and  Pregel  depend  on  graph  partitioning  to  minimize  communication  and  ensure  work  balance.  However,  in 
the  case  of  natural  graphs  both  are  forced  to  resort  to  hash-based  (random)  partitioning  which  has  extremely 
poor  locality  (Section  6.5). 

Communication:  The  skewed  degree  distribution  of  natural-graphs  leads  to  communication  asymmetry 
and  consequently  bottlenecks.  In  addition,  high-degree  vertices  can  force  messaging  abstractions,  such  as 
Pregel,  to  generate  and  send  many  identical  messages. 

Storage:  Since  graph  parallel  abstractions  must  locally  store  the  adjacency  information  for  each  vertex, 
each  vertex  requires  memory  linear  in  its  degree.  Consequently,  high-degree  vertices  can  exceed  the 
memory  capacity  of  a  single  machine. 

Computation:  While  multiple  vertex-programs  may  execute  in  parallel,  existing  graph-parallel  abstrac¬ 
tions  do  not  parallelize  within  individual  vertex-programs,  limiting  their  scalability  on  high-degree  ver¬ 
tices. 


6.4  PowerGraph  Abstraction 


To  address  the  challenges  of  computation  on  power-law  graphs,  we  introduce  PowerGraph,  a  new  graph- 
parallel  abstraction  that  eliminates  the  degree  dependence  of  the  vertex -program  by  directly  exploiting  the 
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interface  GASVertexProgram ( u)  { 

//  Run  on  gather_nbrs (u) 

gather  (Du,  Du^v,  Dv )  — >•  Accum 

sum {Accum  left,  Accum  right)  —¥  Accum 

apply  (Du,  Accum)  -A  D£ew 

//  Run  on  scatter_nbrs (u) 

scatter  ( D^ew ,  ,  Dv )  ->  ( D^v ,  Accum) 

} 


Figure  6.2:  All  PowerGraph  programs  must  implement  the  stateless  gather,  sum,  apply,  and  scatter 
functions. 


GAS  decomposition  to  factor  vertex-programs  over  edges.  By  lifting  the  Gather  and  Scatter  phases  into 
the  abstraction,  PowerGraph  is  able  to  retain  the  natural  “think-like-a-vertex”  philosophy  [Malewicz  et  al., 
2010]  while  distributing  the  computation  of  a  single  vertex-program  over  the  entire  cluster. 

PowerGraph  combines  the  best  features  from  both  Pregel  and  GraphLab.  From  GraphLab,  PowerGraph 
borrows  the  data-graph  and  shared-memory  view  of  computation  eliminating  the  need  for  users  to  architect 
the  movement  of  information.  From  Pregel,  PowerGraph  borrows  the  commutative,  associative  gather 
concept.  PowerGraph  supports  both  the  highly-parallel  bulk-synchronous  Pregel  model  of  computation  as 
well  as  the  computationally  efficient  asynchronous  GraphLab  model  of  computation. 

Like  GraphLab,  the  state  of  a  PowerGraph  program  factors  according  to  a  data-graph  with  user  defined 
vertex  data  Dv  and  edge  data  Du^,v.  The  data  stored  in  the  data-graph  includes  both  meta-data  (e.g.,  urls 
and  edge  weights)  as  well  as  computation  state  (e.g.,  the  PageRank  of  vertices).  In  Section  6.5  we  introduce 
vertex-cuts  which  allow  PowerGraph  to  efficiently  represent  and  store  power-law  graphs  in  a  distributed 
environment.  We  now  describe  the  PowerGraph  abstraction  and  how  it  can  be  used  to  naturally  decompose 
vertex-programs.  Then  in  Section  6.5  through  Section  6.7  we  discuss  how  to  implement  the  PowerGraph 
abstraction  in  a  distributed  environment. 


6.4.1  GAS  Vertex-Programs 

Computation  in  the  PowerGraph  abstraction  is  encoded  as  a  state-less  vertex-program  which  implements 
the  GASVertexProgram  interface  (Figure  6.2)  and  therefore  explicitly  factors  into  the  gather,  sum, 
apply,  and  scatter  functions.  Each  function  is  invoked  in  stages  by  the  PowerGraph  engine  following  the 
semantics  in  Alg.  6.1.  By  factoring  the  vertex-program,  the  PowerGraph  execution  engine  can  distribute  a 
single  vertex-program  over  multiple  machines  and  move  computation  to  the  data. 

During  the  gather  phase  the  gather  and  sum  functions  are  used  as  a  map  and  reduce  to  collect  information 
about  the  neighborhood  of  the  vertex.  The  gather  function  is  invoked  in  parallel  on  the  edges  adjacent 
to  u.  The  particular  set  of  edges  is  determined  by  gathermbrs  which  can  be  none,  in,  out,  or  all. 
The  gather  function  is  passed  the  data  on  the  adjacent  vertex  and  edge  and  returns  a  temporary  accumulator 
(a  user  defined  type).  The  result  is  combined  using  the  commutative  and  associative  sum  operation.  The 
final  result  au  of  the  gather  phase  is  passed  to  the  apply  phase  and  cached  by  PowerGraph. 

After  the  gather  phase  has  completed,  the  apply  function  takes  the  final  accumulator  and  computes  a 
new  vertex  value  Du  which  is  atomically  written  back  to  the  graph.  The  size  of  the  accumulator  a„  and 


130 


Algorithm  6.1:  Vertex-Program  Execution  Semantics 
Input:  Center  vertex  u 
if  cached  accumulator  au  is  empty  then 
foreach  neighbor  v  in  gather  jibrs(u)  do 
j_  au  <r-  sum(a„,  gather]  D„,  Du^v,  Dv)) 

Du  t-  apply (Du,  au) 
foreach  neighbor  v  scatter  Jibrs(u)  do 
(Du—. Aa)  i  scatter(Du,  Du^v,  D v) 
if  av  and  A  a  are  not  Empty  then  av  sum(a„,  Aa) 
else  av  Empty 


complexity  of  the  apply  function  play  a  central  role  in  determining  the  network  and  storage  efficiency  of 
the  PowerGraph  abstraction  and  should  be  sub-linear  and  ideally  constant  in  the  degree. 

During  the  scatter  phase,  the  scatter  function  is  invoked  in  parallel  on  the  edges  adjacent  to  u  pro¬ 
ducing  new  edge  values  Du^v  which  are  written  back  to  the  data-graph.  As  with  the  gather  phase,  the 
scattermbrs  determines  the  particular  set  of  edges  on  which  scatter  is  invoked.  The  scatter  function 
returns  an  optional  value  Aa  which  is  used  to  dynamically  update  the  cached  accumulator  av  for  the 
adjacent  vertex  (see  Section  6.4.2). 

In  Figure  6.3  we  implement  the  PageRank,  greedy  graph  coloring,  and  single  source  shortest  path  algorithms 
using  the  PowerGraph  abstraction.  In  PageRank  the  gather  and  sum  functions  collect  the  total  value  of 
the  adjacent  vertices,  the  apply  function  computes  the  new  PageRank,  and  the  scatter  function  is  used  to 
activate  adjacent  vertex-programs  if  necessary.  In  graph  coloring  the  gather  and  sum  functions  collect 
the  set  of  colors  on  adjacent  vertices,  the  apply  function  computes  a  new  color,  and  the  scatter  function 
activates  adjacent  vertices  if  they  violate  the  coloring  constraint.  Finally  in  single  source  shortest  path 
(SSSP),  the  gather  and  sum  functions  compute  the  shortest  path  through  each  of  the  neighbors,  the  apply 
function  returns  the  new  distance,  and  the  scatter  function  activates  affected  neighbors. 


6.4.2  Delta  Caching 

In  many  cases  a  vertex-program  will  be  triggered  in  response  to  a  change  in  a  few  of  its  neighbors.  The 
gather  operation  is  then  repeatedly  invoked  on  cdl  neighbors,  many  of  which  remain  unchanged,  thereby 
wasting  computation  cycles.  For  many  algorithms  [Ahmed  et  al.,  2012]  it  is  possible  to  dynamically 
maintain  the  result  of  the  gather  phase  au  and  skip  the  gather  on  subsequent  iterations. 

The  PowerGraph  engine  maintains  a  cache  of  the  accumulator  au  from  the  previous  gather  phase  for 
each  vertex.  The  scatter  function  can  optionally  return  an  additional  Aa  which  is  atomically  added  to 
the  cached  accumulator  av  of  the  neighboring  vertex  v  using  the  sum  function.  If  Aa  is  not  returned, 
then  the  neighbor’s  cached  av  is  cleared,  forcing  a  complete  gather  on  the  subsequent  execution  of  the 
vertex-program  on  the  vertex  v.  When  executing  the  vertex -program  on  v  the  PowerGraph  engine  uses  the 
cached  av  if  available,  bypassing  the  gather  phase. 

Intuitively,  Aa  acts  as  an  additive  correction  on-top  of  the  previous  gather  for  that  edge.  More  formally,  if 
the  accumulator  type  forms  an  abelian  group:  has  a  commutative  and  associative  sum  (+)  and  an  inverse 
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PageRank  Single  Source  Shortest  Path  (SSSP) 


//  gather_nbrs :  IN_NBRS 

gather  (Du ,  Du^v ,  Dv)  : 

return  Dv  . rank  /  #outNbrs (v) 
sum(a,  b) :  return  a  +  b 
apply (Du,  acc) : 

rnew  =  0.15  +  0.85  *  acc 

Du  . delta  =  (rnew  -  Du . rank)/ 

#outNbrs (u) 

Du  . rank  =  rnew 
//  scatter_nbrs :  OUT_NBRS 
scatter  ( Du ,  Du^v ,  Dv )  : 

if  (  j Du  ■ delta | >e)  Activate (v) 
return  delta 

//  gather_nbrs :  ALL_NBRS 
gather  (Du,  Du^v,  Dv)  : 

return  Dv  +  Dv^u 
sum(a,  b) :  return  min (a,  b) 
apply (Du,  new_dist) : 

Du  =  new_dist 
//  scatter_nbrs :  ALL_NBRS 
scatter  ( D„ ,  Du^v ,  Dv)  : 

II  If  changed  activate  neighbor 
if  (changed (Du) )  Activate (v) 
if  ( increased (Du ) ) 
return  NULL 

else  return  Du  +  Du^v 

Greedy  Graph  Coloring 

Belief  Propagation  (BP) 

//  gather_nbrs:  ALL_NBRS 

//  gather_nbrs:  IN_NBRS 

gather  (Du,  ,  Dv)  : 

gather  (Du,  Dv)  : 

return  set (Dv) 

swap  old  and  new  messages 

sum(a,  b) :  return  union (a,  b) 

return  old  message 

apply ( Du ,  S)  : 

sum(a,  b) :  return  a  *  b 

Du  =  min  c  where  c  S 

apply (Du,  new_dist) : 

//  scatter_nbrs :  ALL_NBRS 

update  belief 

scatter  ( Du ,  Du^v ,  Dv)  : 

//  scatter__nbrs :  OUT_NBRS 

II  Nbr  changed  since  gather 

scatter  ( D„ ,  ,  Dv )  : 

if  (Du  ==  Dv) 

Recompute  new  out  message 

Activate ( v ) 

if (message  changed  sufficiently) 

II  Invalidate  cached  accum 

Activate (v) 

return  NULL 

return  NULL 

Figure  6.3:  The  PageRank,  graph-coloring,  and  single  source  shortest  path  algorithms  implemented  in 
the  PowerGraph  abstraction.  Both  the  PageRank  and  single  source  shortest  path  algorithms  support  delta 
caching  in  the  gather  phase. 


(— )  operation,  then  we  can  define  (shortening  gather  to  g ): 

Aa  =  g(Du,D™v,  D '“ew)  -  g(Du,  Du^v,  Dv).  (6.5) 

In  the  PageRank  example  (Figure  6.3)  we  take  advantage  of  the  abelian  nature  of  the  PageRank  sum  opera¬ 
tion.  For  graph  coloring  the  set  union  operation  is  not  abelian  and  so  we  invalidate  the  accumulator. 


6.4.3  Initiating  Future  Computation 

The  PowerGraph  engine  maintains  a  set  of  active  vertices  on  which  to  eventually  execute  the  vertex-program. 
The  user  initiates  computation  by  calling  Activate  (v)  or  Activate_all  () .  The  PowerGraph  engine 
then  proceeds  to  execute  the  vertex -program  on  the  active  vertices  until  none  remain.  Once  a  vertex-program 
completes  the  scatter  phase  it  becomes  inactive  until  it  is  reactivated. 

Vertices  can  activate  themselves  and  neighboring  vertices.  Each  function  in  a  vertex -program  can  only 
activate  vertices  visible  in  the  arguments  to  that  function.  For  example  the  scatter  function  invoked  on 
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the  edge  (u,  v )  can  only  activate  the  vertices  u  and  v.  This  restriction  is  essential  to  ensure  that  activation 
events  arc  generated  on  machines  on  which  they  can  be  efficiently  processed. 

The  order  in  which  activated  vertices  are  executed  is  up  to  the  PowerGraph  execution  engine.  The  only 
guarantee  is  that  all  activated  vertices  are  eventually  executed.  This  flexibility  in  scheduling  enables 
PowerGraph  programs  to  be  executed  both  synchronously  and  asynchronously,  leading  to  different  tradeoffs 
in  algorithm  performance,  system  performance,  and  determinism. 


Bulk  Synchronous  Execution 

When  run  synchronously,  the  PowerGraph  engine  executes  the  gather,  apply,  and  scatter  phases  in  order. 
Each  phase,  called  a  minor-step,  is  run  synchronously  on  all  active  vertices  with  a  barrier  at  the  end.  We 
define  a  super-step  as  a  complete  series  of  GAS  minor-steps.  Changes  made  to  the  vertex  data  and  edge 
data  are  committed  at  the  end  of  each  minor-step  and  are  visible  in  the  subsequent  minor-step.  Vertices 
activated  in  each  super-step  arc  executed  in  the  subsequent  super-step. 

The  synchronous  execution  model  ensures  a  deterministic  execution  regardless  of  the  number  of  machines 
and  closely  resembles  Pregel.  However,  the  frequent  barriers  and  inability  to  operate  on  the  most  recent 
data  can  lead  to  an  inefficient  distributed  execution  and  slow  algorithm  convergence.  To  address  these 
limitations  PowerGraph  also  supports  asynchronous  execution. 


Asynchronous  Execution 

When  run  asynchronously,  the  PowerGraph  engine  executes  active  vertices  as  processor  and  network 
resources  become  available.  Changes  made  to  the  vertex  and  edge  data  during  the  apply  and  scatter 
functions  are  immediately  committed  to  the  graph  and  visible  to  subsequent  computation  on  neighboring 
vertices. 

Unfortunately,  the  behavior  of  the  asynchronous  execution  depends  on  the  number  machines  and  availability 
of  network  resources  leading  to  non-determinism  that  can  complicate  algorithm  design  and  debugging. 
Furthermore,  for  some  algorithms,  like  statistical  simulation  (see  Chapter  4),  the  resulting  non-determinism, 
if  not  carefully  controlled,  can  lead  to  instability  or  even  divergence. 

To  address  these  challenges,  GraphLab  automatically  enforces  serializability:  every  parallel  execution  of 
vertex -programs  has  a  corresponding  sequential  execution.  In  Chapter  5  it  was  shown  that  serializability  is 
sufficient  to  support  a  wide  range  of  MLDM  algorithms.  To  achieve  serializability,  GraphLab  prevents 
adjacent  vertex-programs  from  running  concurrently  using  a  fine-grained  locking  protocol  which  requires 
sequentially  grabbing  locks  on  all  neighboring  vertices.  Furthermore,  the  locking  scheme  used  by  GraphLab 
is  unfair  to  high  degree  vertices. 

PowerGraph  retains  the  strong  serializability  guarantees  of  GraphLab  while  addressing  its  limitations. 
We  address  the  problem  of  sequential  locking  by  introducing  a  new  parallel  locking  protocol  (described 
in  Section  6.7.4)  which  is  fair  to  high  degree  vertices.  In  addition,  the  PowerGraph  abstraction  exposes 
substantially  more  fined  grained  (edge-level)  parallelism  allowing  the  entire  cluster  to  support  the  execution 
of  individual  vertex  programs. 
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6.4.4  Comparison  with  GraphLab  /  Pregel 


Surprisingly,  despite  the  strong  constraints  imposed  by  the  PowerGraph  abstraction,  it  is  possible  to  emulate 
both  GraphLab  and  Pregel  vertex-programs  in  PowerGraph.  To  emulate  a  GraphLab  vertex-program,  we 
use  the  gather  and  sum  functions  to  concatenate  all  the  data  on  adjacent  vertices  and  edges  and  then  run  the 
GraphLab  program  within  the  apply  function.  Similarly,  to  express  a  Pregel  vertex-program,  we  use  the 
gather  and  sum  functions  to  combine  the  inbound  messages  (stored  as  edge  data)  and  concatenate  the  list 
of  neighbors  needed  to  compute  the  outbound  messages.  The  Pregel  vertex-program  then  runs  within  the 
apply  function  generating  the  set  of  messages  which  arc  passed  as  vertex  data  to  the  scatter  function  where 
they  arc  written  back  to  the  edges. 

In  order  to  address  the  challenges  of  natural  graphs,  the  PowerGraph  abstraction  requires  the  size  of  the 
accumulator  and  the  complexity  of  the  apply  function  to  be  sub-linear  in  the  degree.  However,  directly 
executing  GraphLab  and  Pregel  vertex-programs  within  the  apply  function  leads  the  size  of  the  accumulator 
and  the  complexity  of  the  apply  function  to  be  linear  in  the  degree  eliminating  many  of  the  benefits  on 
natural  graphs. 


6.5  Distributed  Graph  Placement 


The  PowerGraph  abstraction  relies  on  the  distributed  data-graph  to  store  the  computation  state  and  encode 
the  interaction  between  vertex-programs.  The  placement  of  the  data-graph  structure  and  data  plays  a  central 
role  in  minimizing  communication  and  ensuring  work  balance. 

A  common  approach  to  placing  a  graph  on  a  cluster  of  p  machines  is  to  construct  a  balanced  p- way  edge-cut 
(e.g.,  Figure  6.4a)  in  which  vertices  are  evenly  assigned  to  machines  and  the  number  of  edges  spanning 
machines  is  minimized.  Unfortunately,  the  tools  [Karypis  and  Kumar,  1998,  Pellegrini  and  Roman,  1996] 
for  constructing  balanced  edge-cuts  perform  poorly  [Abou-Rjeili  and  Karypis,  2006,  Lang,  2004,  Leskovec 
et  al.,  2008]  on  power-law  graphs.  When  the  graph  is  difficult  to  partition,  both  GraphLab  and  Pregel  resort 
to  hashed  (random)  vertex  placement.  While  fast  and  easy  to  implement,  hashed  vertex  placement  cuts 
most  of  the  edges: 

Theorem  6.5.1.  If  vertices  are  randomly  assigned  to  p  machines  then  the  expected  fraction  of  edges  cut  is: 
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For  a  power-law  graph  with  exponent  a,  the  expected  number  of  edges  cut  per-vertex  is: 
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where  the  h|V  |  ( a )  =  1  d  a 


is  the  normalizing  constant  of  the  power-law  Zipf  distribution. 


Proof.  An  edge  is  cut  if  both  vertices  are  randomly  assigned  to  different  machines.  The  probability  that 
both  vertices  are  assigned  to  different  machines  is  1  —  1/p.  □ 


Every  cut  edge  contributes  to  storage  and  network  overhead  since  both  machines  maintain  a  copy  of  the 
adjacency  information  and  in  some  cases  Gregor  and  Lumsdaine  [2005],  a  ghost  (local  copy)  of  the  vertex 
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Figure  6.4:  (a)  An  edge-cut  and  (b)  vertex-cut  of  a  graph  into  three  parts.  Shaded  vertices  are  ghosts  and 
mirrors  respectively. 


Figure  6.5:  The  communication  pattern  of  the  PowerGraph  abstraction  when  using  a  vertex-cut.  Gather 
function  runs  locally  on  each  machine  and  then  one  accumulators  is  sent  from  each  mirror  to  the  master. 
The  master  runs  the  apply  function  and  then  sends  the  updated  vertex  data  to  all  mirrors.  Finally  the  scatter 
phase  is  run  in  parallel  on  mirrors. 


and  edge  data.  For  example  in  Figure  6.4a  we  construct  a  three-way  edge-cut  of  a  four  vertex  graph 
resulting  in  live  ghost  vertices  and  all  edge  data  being  replicated.  Any  changes  to  vertex  and  edge  data 
associated  with  a  cut  edge  must  be  synchronized  across  the  network.  For  example,  using  just  two  machines, 
a  random  cut  will  cut  roughly  half  the  edges,  requiring  |Ej  /2  communication. 


6.5.1  Balanced  p- way  Vertex-Cut 

By  factoring  the  vertex  program  along  the  edges  in  the  graph.  The  PowerGraph  abstraction  allows  a  single 
vertex -program  to  span  multiple  machines.  In  Figure  6.5  a  single  high  degree  vertex  program  has  been 
split  across  two  machines  with  the  gather  and  scatter  functions  running  in  parallel  on  each  machine  and 
accumulator  and  vertex  data  being  exchanged  across  the  network. 

Because  the  PowerGraph  abstraction  allows  a  single  vertex-program  to  span  multiple  machines,  we  can 
improve  work  balance  and  reduce  communication  and  storage  overhead  by  evenly  assigning  edges  to 
machines  and  allowing  vertices  to  span  machines.  Each  machine  only  stores  the  edge  information  for  the 
edges  assigned  to  that  machine,  evenly  distributing  the  massive  amounts  of  edge  data.  Since  each  edge  is 
stored  exactly  once,  changes  to  edge  data  do  not  need  to  be  communicated.  However,  changes  to  vertex 
must  be  copied  to  all  the  machines  it  spans,  thus  the  storage  and  network  overhead  depend  on  the  number 
of  machines  spanned  by  each  vertex. 

We  minimize  storage  and  network  overhead  by  limiting  the  number  of  machines  spanned  by  each  vertex. 
A  balanced  p-way  vertex-cut  formalizes  this  objective  by  assigning  each  edge  e  E  E  to  a  machine 
A(e)  £  {!,... .  p\.  Each  vertex  then  spans  the  set  of  machines  A(v)  C  {1, . . .  ,p}  that  contain  its  adjacent 
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edges.  We  define  the  balanced  vertex-cut  objective: 


s.t. 


1  1  vev 

max  \{e  G  E  \  A(e ) 
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(6.9) 


where  the  imbalance  factor  A  >  1  is  a  small  constant.  We  use  the  term  replicas  of  a  vertex  v  to  denote 
the  |A(n)|  copies  of  the  vertex  v:  each  machine  in  A(v)  has  a  replica  of  v.  Because  changes  to  vertex 
data  are  communicated  to  all  replicas,  the  communication  overhead  is  also  given  by  The  objective 

(Eq.  (6.8))  therefore  minimizes  the  average  number  of  replicas  in  the  graph  and  as  a  consequence  the  total 
storage  and  communication  requirements  of  the  PowerGraph  engine. 

For  each  vertex  v  with  multiple  replicas,  one  of  the  replicas  is  randomly  nominated  as  the  master  which 
maintains  the  master  version  of  the  vertex  data.  All  remaining  replicas  of  v  are  then  mirrors  and  maintain  a 
local  cached  read  only  copy  of  the  vertex  data,  (e.g..  Figure  6.4b).  For  instance,  in  Figure  6.4b  we  construct 
a  three-way  vertex-cut  of  a  graph  yielding  only  2  mirrors.  Any  changes  to  the  vertex  data  (e.g.,  the  Apply 
function)  must  be  made  to  the  master  which  is  then  immediately  replicated  to  all  mirrors. 

Vertex-cuts  address  the  major  issues  associated  with  edge-cuts  in  power-law  graphs.  Percolation  theory 
[Albert  et  al.,  2000]  suggests  that  power-law  graphs  have  good  vertex-cuts.  Intuitively,  by  cutting  a  small 
fraction  of  the  very  high  degree  vertices  we  can  quickly  shatter  a  graph.  Furthermore,  because  the  balance 
constraint  (Eq.  (6.9))  ensures  that  edges  are  uniformly  distributed  over  machines,  we  naturally  achieve 
improved  work  balance  even  in  the  presence  of  very  high-degree  vertices. 

The  simplest  method  to  construct  a  vertex  cut  is  to  randomly  assign  edges  to  machines.  Random  (hashed) 
edge  placement  is  fully  data-parallel,  achieves  nearly  perfect  balance  on  large  graphs,  and  can  be  applied 
in  the  streaming  setting.  In  the  following  theorem,  we  relate  the  expected  normalized  replication  factor 
(Eq.  (6.8))  to  the  number  of  machines  and  the  power-law  constant  a. 

Theorem  6.5.2  (Randomized  Vertex  Cuts).  A  random  vertex-cut  on  p  machines  has  an  expected  replication: 
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where  D[n]  denotes  the  degree  of  vertex  v.  For  a  power-law  graph  the  expected  replication  (Figure  6.6a)  is 
determined  entirely  by  the  power-law  constant  a: 
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where 
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is  the  normalizing  constant  of  the  power-law  Zipf  distribution. 


Proof.  By  linearity  of  expectation: 
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Figure  6.6:  (a)  Expected  replication  factor  for  different  power-law  constants,  (b)  The  ratio  of  the  expected 
communication  and  storage  cost  of  random  edge  cuts  to  random  vertex  cuts  as  a  function  of  the  number 
machines.  This  graph  assumes  that  edge  data  and  vertex  data  are  the  same  size. 


The  expected  replication  E  [|A(n)|]  of  a  single  vertex  v  can  be  computed  by  considering  the  process  of 
randomly  assigning  the  D[u]  edges  adjacent  to  v.  Let  the  indicator  X,  denote  the  event  that  vertex  v  has  at 
least  one  of  its  edges  on  machine  i.  The  expectation  E  X,]  is  then: 
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The  expected  replication  factor  for  vertex  v  is  then: 
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Treating  D[n]  as  a  Zipf  random  variable: 
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and  taking  the  expectation  under  P  (d)  =  d  a  j  h|yi  (a): 
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While  lower  a  values  (more  high-degree  vertices)  imply  a  higher  replication  factor  (Figure  6.6a)  the 
effective  gains  of  vertex-cuts  relative  to  edge  cuts  (Figure  6.6b)  actually  increase  with  lower  a.  In 
Figure  6.6b  we  plot  the  ratio  of  the  expected  costs  (comm,  and  storage)  of  random  edge-cuts  (Eq.  (6.7))  to 
the  expected  costs  of  random  vertex-cuts  (Eq.  (6. 11))  demonstrating  order  of  magnitude  gains. 

Finally,  the  vertex  cut  model  is  also  highly  effective  for  regular  graphs  since  in  the  event  that  a  good 
edge-cut  can  be  found  it  can  be  converted  to  a  better  vertex  cut: 
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Theorem  6.5.3.  For  a  given  an  edge-cut  with  g  ghosts,  any  vertex  cut  along  the  same  partition  boundary 
has  strictly  fewer  than  g  mirrors. 

Proof  of  Theorem  6.5.3.  Consider  the  two-way  edge  cut  which  cuts  the  set  of  edges  E'  €  E  and  let  V  be 
the  set  of  vertices  in  E' .  The  total  number  of  ghosts  induced  by  this  edge  partition  is  therefore  \  V'\.  If  we 
then  select  and  delete  arbitrary  vertices  from  V'  along  with  their  adjacent  edges  until  no  edges  remain,  then 
the  set  of  deleted  vertices  corresponds  to  a  vertex-cut  in  the  original  graph.  Since  at  most  V  —  1  vertices 
may  be  deleted,  there  can  be  at  most  V'/|  —  1  mirrors.  □ 

6.5.2  Greedy  Vertex-Cuts 

We  can  improve  upon  the  randomly  constructed  vertex-cut  by  de -randomizing  the  edge -placement  process. 
The  resulting  algorithm  is  a  sequential  greedy  heuristic  which  places  the  next  edge  on  the  machine  that 
minimizes  the  conditional  expected  replication  factor.  To  construct  the  de -randomization  we  consider  the 
task  of  placing  the  i  +  1  edge  after  having  placed  the  previous  i  edges.  Using  the  conditional  expectation 
we  define  the  objective: 

argminE  \A(v)\  \  At,  A(ei+i)  =  k  ,  (6.18) 

k  z — ' 

Lvev  J 

where  Ai  is  the  assignment  for  the  previous  i  edges.  Using  Theorem  6.5.2  to  evaluate  Eq.  (6.18)  we  obtain 
the  following  edge  placement  rules  for  the  edge  (u,  v): 

Case  1:  If  A{u)  and  A(v)  intersect,  then  the  edge  should  be  assigned  to  a  machine  in  the  intersection. 

Case  2:  If  A(u)  and  A(v)  arc  not  empty  and  do  not  intersect,  then  the  edge  should  be  assigned  to  one  of 
the  machines  from  the  vertex  with  the  most  unassigned  edges. 

Case  3:  If  only  one  of  the  two  vertices  has  been  assigned,  then  choose  a  machine  from  the  assigned  vertex. 

Case  4:  If  neither  vertex  has  been  assigned,  then  assign  the  edge  to  the  least  loaded  machine. 

Because  the  greedy-heuristic  is  a  de-randomization  it  is  guaranteed  to  obtain  an  expected  replication  factor 
that  is  no  worse  than  random  placement  and  in  practice  can  be  much  better.  Unlike  the  randomized  algorithm, 
which  is  embarrassingly  parallel  and  easily  distributed,  the  greedy  algorithm  requires  coordination  between 
machines.  We  consider  two  distributed  implementations: 

Coordinated:  maintains  the  values  of  Afv)  in  a  distributed  table.  Then  each  machine  runs  the  greedy 
heuristic  and  periodically  updates  the  distributed  table.  Local  caching  is  used  to  reduce  communica¬ 
tion  at  the  expense  of  accuracy  in  the  estimate  of  Ajfv). 

Oblivious:  runs  the  greedy  heuristic  independently  on  each  machine.  Each  machine  maintains  its  own 
estimate  of  ,4,  with  no  additional  communication. 

In  Figure  6.8a,  we  compare  the  replication  factor  of  both  heuristics  against  random  vertex  cuts  on  the 
Twitter  follower  network.  We  plot  the  replication  factor  as  a  function  of  the  number  of  machines  (EC2 
instances  described  in  Section  6.7)  and  find  that  random  vertex  cuts  match  the  predicted  replication  given  in 
Theorem  6.5.2.  Furthermore,  the  greedy  heuristics  substantially  improve  upon  random  placement  with  an 
order  of  magnitude  reduction  in  the  replication  factor,  and  therefore  communication  and  storage  costs.  For 
a  fixed  number  of  machines  (p  =  32),  we  evaluated  (Figure  6.7a)  the  replication  factor  of  the  two  heuristics 
on  five  real- world  graphs  (Table  6.  la).  In  all  cases  the  greedy  heuristics  out-perform  random  placement, 
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(a)  Actual  Replication  (b)  Effect  of  Partitioning 


Figure  6.7:  (a)  The  actual  replication  factor  on  32  machines,  (b)  The  effect  of  partitioning  on  runtime. 


a 

#  Edges 

1.8 

641,383,778 

1.9 

245,040,680 

2.0 

102,838,432 
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57,134,471 

2.2 

35,001,696 

Graph 
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Twitter  [Kwak  et  ah,  2010] 

41M 

1.4B 

UK  [Bordino  et  al.,  2008] 

132.8M 

5.5B 

Amazon  [Boldi  and  Vigna,  2004,  Boldi  et  ah,  2011] 

0.7M 

5.2M 

LiveJournal  [Leskovec,  2011] 

5.4M 

79M 

Hollywood  [Boldi  and  Vigna,  2004,  Boldi  et  al.,  201 1] 

2.2M 

229M 

(a)  Real  world  graphs  (b)  Synthetic  Graphs 

Table  6.1:  (a)  A  collection  of  Real  world  graphs,  (b)  Randomly  constructed  ten-million  vertex  power-law 
graphs  with  varying  a.  Smaller  a  produces  denser  graphs. 


while  doubling  the  load  time  (Figure  6.8b).  The  Oblivious  heuristic  achieves  compromise  by  obtaining  a 
relatively  low  replication  factor  while  only  slightly  increasing  runtime. 


6.6  Abstraction  Comparison 


In  this  section,  we  experimentally  characterize  the  dependence  on  a  and  the  relationship  between  fan-in  and 
fan-out  by  using  the  Pregel,  GraphLab,  and  PowerGraph  abstractions  to  run  PageRank  on  five  synthetically 
constructed  power-law  graphs.  Each  graph  has  ten-million  vertices  and  an  a  ranging  from  1.8  to  2.2.  The 
graphs  were  constructed  by  randomly  sampling  the  out-degree  of  each  vertex  from  a  Zipf  distribution 
and  then  adding  out-edges  such  that  the  in-degree  of  each  vertex  is  nearly  identical.  We  then  inverted 
each  graph  to  obtain  the  corresponding  power-law  fan-in  graph.  The  density  of  each  power-law  graph  is 
determined  by  a  and  therefore  each  graph  has  a  different  number  of  edges  (see  Table  6.  lb). 

We  used  the  GraphLab  vl  C++  implementation  from  Chapter  5  and  added  instrumentation  to  track  network 
usage.  As  of  the  writing  of  this  chapter,  public  implementations  of  Pregel  (e.g.,  Giraph)  were  unable  to 
handle  even  our  smaller  synthetic  problems  due  to  memory  limitations.  Consequently,  we  used  Piccolo 
[Power  and  Li,  2010]  as  a  proxy  implementation  of  Pregel  since  Piccolo  naturally  expresses  the  Pregel 
abstraction  and  provides  an  efficient  C++  implementation  with  dynamic  load-balancing.  Linally,  we  used 
our  implementation  of  PowerGraph  described  in  Section  6.7. 
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Figure  6.8:  (a,b)  Replication  factor  and  runtime  of  graph  ingress  for  the  Twitter  follower  network  as  a 
function  of  the  number  of  machines  for  random,  oblivious,  and  coordinated  vertex-cuts. 


All  experiments  in  this  section  are  evaluated  on  an  eight  node  Linux  cluster.  Each  node  consists  of  two 
quad-core  Intel  Xeon  E5620  processors  with  32  GB  of  RAM  and  is  connected  via  1-GigE  Ethernet.  All 
systems  were  compiled  with  GCC  4.4.  GraphLab  and  Piccolo  used  random  edge-cuts  while  PowerGraph 
used  random  vertex-cuts..  Results  are  averaged  over  20  iterations. 


6.6.1  Computation  Imbalance 

The  sequential  component  of  the  PageRank  vertex-program  is  proportional  to  out-degree  in  the  Pregel 
abstraction  and  in-degree  in  the  GraphLab  abstraction.  Alternatively,  PowerGraph  eliminates  this  sequen¬ 
tial  dependence  by  distributing  the  computation  of  individual  vertex -programs  over  multiple  machines. 
Therefore  we  expect,  highly-skewed  (low  a)  power-law  graphs  to  increase  work  imbalance  under  the  Pregel 
(fan-in)  and  GraphLab  (fan-out)  abstractions  but  not  under  the  PowerGraph  abstraction,  which  evenly 
distributed  high-degree  vertex-programs.  To  evaluate  this  hypothesis  we  ran  eight  “workers”  per  system 
(64  total  workers)  and  recorded  the  vertex-program  time  on  each  worker. 

In  Figure  6.9a  and  Figure  6.9b  we  plot  the  standard  deviation  of  worker  per-iteration  runtimes,  a  measure 
of  work  imbalance,  for  power-law  fan-in  and  fan-out  graphs  respectively.  Higher  standard  deviation  implies 
greater  imbalance.  While  lower  a  increases  work  imbalance  for  GraphLab  (on  fan-in)  and  Pregel  (on 
fan-out),  the  PowerGraph  abstraction  is  unaffected  in  either  edge  direction. 


6.6.2  Communication  Imbalance 

Because  GraphLab  and  Pregel  use  edge-cuts,  their  communication  volume  is  proportional  to  the  number  of 
ghosts:  the  replicated  vertex  and  edge  data  along  the  partition  boundary.  If  one  message  is  sent  per  edge, 
Pregel’s  combiners  ensure  that  exactly  one  network  message  is  transmitted  for  each  ghost.  Similarly,  at 
the  end  of  each  iteration  GraphLab  synchronizes  each  ghost  and  thus  the  communication  volume  is  also 
proportional  to  the  number  of  ghosts.  PowerGraph  on  the  other  hand  uses  vertex-cuts  and  only  synchronizes 
mirrors  after  each  iteration.  The  communication  volume  of  a  complete  iteration  is  therefore  proportional  to 
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Figure  6.9:  Synthetic  Experiments:  Work  Imbalance  and  Communication,  (a,  b)  Standard  deviation 
of  worker  computation  time  across  8  distributed  workers  for  each  abstraction  on  power-law  fan-in  and 
fan-out  graphs,  (b,  c)  Bytes  communicated  per  iteration  for  each  abstraction  on  power-law  fan-in  and 
fan-out  graphs. 


the  number  of  mirrors  induced  by  the  vertex -cut.  As  a  consequence  we  expect  that  PowerGraph  will  reduce 
communication  volume. 


In  Figure  6.9c  and  Figure  6.9d  we  plot  the  bytes  communicated  per  iteration  for  all  three  systems  under 
power-law  fan-in  and  fan-out  graphs.  Because  Pregel  only  sends  messages  along  out-edges,  Pregel 
communicates  more  on  power-law  fan-out  graphs  than  on  power-law  fan-in  graphs. 


On  the  other  hand,  GraphLab  and  PowerGraph’s  communication  volume  is  invariant  to  power-law  fan-in 
and  fan-out  since  neither  considers  edge  direction  during  data-synchronization.  Flowever,  PowerGraph 
communicates  significantly  less  than  GraphLab  which  is  a  direct  result  of  the  efficacy  of  vertex  cuts. 
Finally,  PowerGraph’s  total  communication  increases  only  marginally  on  the  denser  graphs  and  is  the 
lowest  overall. 
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Figure  6.10:  Synthetic  Experiments  Runtime,  (a,  b)  Per  iteration  runtime  of  each  abstraction  on  synthetic 
power-law  graphs. 

6.6.3  Runtime  Comparison 

PowerGraph  significantly  out-performs  GraphLab  and  Pregel  on  low  a  graphs.  In  Figure  6.10a  and 
Figure  6.10b  we  plot  the  per  iteration  runtime  for  each  abstraction.  In  both  cases  the  overall  runtime  perfor¬ 
mance  closely  matches  the  communication  overhead  (Figure  6.9c  and  Figure  6.9d)  while  the  computation 
imbalance  (Figure  6.9a  and  Figure  6.9b)  appears  to  have  little  effect.  The  limited  effect  of  imbalance  is  due 
to  the  relatively  lightweight  nature  of  the  PageRank  computation  and  we  expect  more  complex  algorithms 
(e.g.,  statistical  inference)  to  be  more  susceptible  to  imbalance.  However,  when  greedy  (coordinated) 
partitioning  is  used  we  see  an  additional  25%  to  50%  improvement  in  runtime. 


6.7  Implementation  and  Evaluation 


In  this  section,  we  describe  and  evaluate  our  implementation  of  the  PowerGraph 1  system.  All  experiments 
are  performed  on  a  64  node  cluster  of  Amazon  EC2  ccl .  4xlarge  Linux  instances.  Each  instance  has 
two  quad  core  Intel  Xeon  X5570  processors  with  23GB  of  RAM,  and  is  connected  via  10  GigE  Ethernet. 
PowerGraph  was  written  in  C++  and  compiled  with  GCC  4.5. 

We  implemented  three  variations  of  the  PowerGraph  abstraction.  To  demonstrate  their  relative  implementa¬ 
tion  complexity,  we  provide  the  line  counts,  excluding  common  support  code: 

Bulk  Synchronous  (Sync):  A  fully  synchronous  implementation  of  PowerGraph  as  described  in  Sec¬ 
tion  6.4.3.  [600  lines] 

Asynchronous  (Async):  An  asynchronous  implementation  of  PowerGraph  which  allows  arbitrary  inter¬ 
leaving  of  vertex-programs  Section  6.4.3.  [900  lines] 

Asynchronous  Serializable  (Async+S):  An  asynchronous  implementation  of  PowerGraph  which  guaran¬ 
tees  serializability  of  all  vertex-programs  (equivalent  to  “edge  consistency”  in  GraphLab).  [1600 
lines] 

'PowerGraph  is  now  part  of  the  GraphLab2  open  source  project  and  can  be  obtained  from  http :  //graphlab .  org. 
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In  all  cases  the  system  is  entirely  symmetric  with  no  single  coordinating  instance  or  scheduler.  Each 
instances  is  given  the  list  of  other  machines  and  start  by  reading  a  unique  subset  of  the  graph  data  files 
from  HDFS.  TCP  connections  are  opened  with  other  machines  as  needed  to  build  the  distributed  graph  and 
run  the  engine. 


6.7.1  Graph  Loading  and  Placement 

The  graph  structure  and  data  are  loaded  from  a  collection  of  text  files  stored  in  a  distributed  file-system 
(HDFS)  by  all  instances  in  parallel.  Each  machine  loads  a  separate  subset  of  files  (determined  by  hashing) 
and  applies  one  of  the  three  distributed  graph  partitioning  algorithms  to  place  the  data  as  it  is  loaded.  As  a 
consequence  partitioning  is  accomplished  in  parallel  and  data  is  immediately  placed  in  its  final  location. 
Unless  specified,  all  experiments  were  performed  using  the  oblivious  algorithm.  Once  computation  is 
complete,  the  final  vertex  and  edge  data  arc  saved  back  to  the  distributed  file-system  in  parallel. 

In  Figure  6.7b,  we  evaluate  the  performance  of  a  collection  of  algorithms  varying  the  partitioning  procedure. 
Our  simple  partitioning  heuristics  are  able  to  improve  performance  significantly  across  all  algorithms, 
decreasing  runtime  and  memory  utilization.  Furthermore,  the  runtime  scales  linearly  with  the  replication 
factor:  halving  the  replication  factor  approximately  halves  runtime. 


6.7.2  Synchronous  Engine  (Sync) 

Our  synchronous  implementation  closely  follows  the  description  in  Section  6.4.3.  Each  machine  runs  a 
single  multi-threaded  instance  to  maximally  utilize  the  multi-core  architecture.  We  rely  on  background 
communication  to  achieve  computation/communication  interleaving.  The  synchronous  engine’s  fully 
deterministic  execution  makes  it  easy  to  reason  about  programmatically  and  minimizes  effort  needed  for 
tuning  and  performance  optimizations. 

In  Figure  6.1  la  and  Figure  6.1  lb  we  plot  the  runtime  and  total  communication  of  one  iteration  of  PageRank 
on  the  Twitter  follower  network  for  each  partitioning  method.  To  provide  a  point  of  comparison  (Table  6.2), 
the  Spark  [Zaharia  et  ah,  2010]  framework  computes  one  iteration  of  PageRank  on  the  same  graph  in  97.4s 
on  a  50  node-100  core  cluster  [Stanton  and  Kliot,  2011].  PowerGraph  is  therefore  between  3-8x  faster  than 
Spark  on  a  comparable  number  of  cores.  On  the  full  cluster  of  512  cores,  we  can  compute  one  iteration  in 
3.6s. 

The  greedy  partitioning  heuristics  improves  both  performance  and  scalability  of  the  engine  at  the  cost  of 
increased  load-time.  The  load  time  for  random,  oblivious,  and  coordinated  placement  were  59,  105,  and 
239  seconds  respectively.  While  greedy  partitioning  heuristics  increased  load-time  by  up  to  a  factor  of  four, 
they  still  improve  overall  runtime  if  more  than  20  iterations  of  PageRank  arc  performed.  In  Figure  6.1  lc 
we  plot  the  runtime  of  each  iteration  of  PageRank  on  the  Twitter  follower  network.  Delta  caching  improves 
performance  by  avoiding  unnecessary  gather  computation,  decreasing  total  runtime  by  45%.  Finally,  in 
Figure  6.1  Id  we  evaluate  weak-scaling:  ability  to  scale  while  keeping  the  problem  size  per  processor 
constant.  We  run  SSSP  (Figure  6.3)  on  synthetic  power-law  graphs  (a  =  2),  with  ten-million  vertices  per 
machine.  Our  implementation  demonstrates  nearly  optimal  weak-scaling  and  requires  only  65s  to  solve  a 
6.4B  edge  graph. 
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(a)  Twitter  PageRank  Runtime 


(b)  Twitter  PageRank  Conims 


Figure  6. 1 1 :  Synchronous  Experiments  (a,b)  Synchronous  PageRank  Scaling  on  Twitter  graph,  (c)  The 
PageRank  per  iteration  runtime  on  the  Twitter  graph  with  and  without  delta  caching,  (d)  Weak  scaling  of 
SSSP  on  synthetic  graphs. 


6.7.3  Asynchronous  Engine  (Async) 

We  implemented  the  asynchronous  PowerGraph  execution  model  (Section  6.4.3)  using  a  simple  state 
machine  for  each  vertex  which  can  be  either:  INACTIVE,  GATHER,  APPLY  or  SCATTER.  Once  activated, 
a  vertex  enters  the  gathering  state  and  is  placed  in  a  local  scheduler  which  assigns  cores  to  active  vertices 
allowing  many  vertex-programs  to  run  simultaneously  thereby  hiding  communication  latency.  While 
arbitrary  interleaving  of  vertex  programs  is  permitted,  we  avoid  data  races  by  ensuring  that  individual 
gather,  apply,  and  scatter  calls  have  exclusive  access  to  their  arguments. 

We  evaluate  the  performance  of  the  Async  engine  by  running  PageRank  on  the  Twitter  follower  network. 
In  Figure  6.12a,  we  plot  throughput  (number  of  vertex-program  operations  per  second)  against  the  number 
of  machines.  Throughput  increases  moderately  with  both  the  number  of  machines  as  well  as  improved 
partitioning.  We  evaluate  the  gains  associated  with  delta  caching  (Section  6.4.2)  by  measuring  throughput 
as  a  function  of  time  (Figure  6.12b)  with  caching  enabled  and  with  caching  disabled.  Caching  allows  the 
algorithm  to  converge  faster  with  fewer  operations.  Surprisingly,  when  caching  is  disabled,  the  throughput 
increases  over  time.  Further  analysis  reveals  that  the  computation  gradually  focuses  on  high-degree  vertices, 
increasing  the  computation/communication  ratio. 
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We  evaluate  the  graph  coloring  vertex-program  (Figure  6.3)  which  cannot  be  run  synchronously  since  all 
vertices  would  change  to  the  same  color  on  every  iteration.  Graph  coloring  is  a  proxy  for  many  MLDM 
algorithms  like  Gibbs  sampling  (Chapter  4).  In  Figure  6.12c  we  evaluate  weak-scaling  on  synthetic 
power-law  graphs  (a  =  2)  with  five-million  vertices  per  machine  and  find  that  the  Async  engine  performs 
nearly  optimally.  The  slight  increase  in  runtime  may  be  attributed  to  an  increase  in  the  number  of  colors 
due  to  increasing  graph  size. 


6.7.4  Async.  Serializable  Engine  (Async+S) 

The  Async  engine  is  useful  for  a  broad  range  of  tasks,  providing  high  throughput  and  performance.  Flowever, 
unlike  the  synchronous  engine,  the  asynchronous  engine  is  difficult  to  reason  about  programmatically.  We 
therefore  extended  the  Async  engine  to  enforce  serializability. 

The  Async+S  engine  ensures  serializability  by  preventing  adjacent  vertex-programs  from  running  simul¬ 
taneously.  Ensuring  serializability  for  graph-parallel  computation  is  equivalent  to  solving  the  dining 
philosophers  problem  where  each  vertex  is  a  philosopher,  and  each  edge  is  a  fork.  GraphLab  implemented 
the  Dijkstra  [1971]  algorithm  where  forks  arc  acquired  sequentially  according  to  a  total  ordering.  Instead, 
PowerGraph  implements  the  Chandy  and  Misra  [1984]  solution  which  acquires  all  forks  simultaneously, 
permitting  a  high  degree  of  parallelism.  We  extend  the  Chandy  and  Misra  [1984]  solution  to  the  vertex-cut 
setting  by  enabling  each  vertex  replica  to  request  only  forks  for  local  edges  and  using  a  simple  consensus 
protocol  to  establish  when  all  replicas  have  succeeded. 

We  evaluate  the  scalability  and  computational  efficiency  of  the  Async+S  engine  on  the  graph  coloring  task. 
We  observe  in  Figure  6.12c  that  the  amount  of  achieved  parallelism  does  not  increase  linearly  with  the 
number  of  vertices.  Because  the  density  (i.e.,  contention)  of  power-law  graphs  increases  super-linearly  with 
the  number  of  vertices,  we  do  not  expect  the  amount  of  serializable  parallelism  to  increase  linearly. 

In  Figure  6.1 2d,  we  plot  the  proportion  of  edges  that  satisfy  the  coloring  condition  (both  vertices  have 
different  colors)  for  both  the  Async  and  the  Async+S  engines.  While  the  Async  engine  quickly  satisfies  the 
coloring  condition  for  most  edges,  the  remaining  1%  take  34%  of  the  runtime.  We  attribute  this  behavior  to 
frequent  races  on  tightly  connected  vertices.  Alternatively,  the  Async+S  engine  performs  more  uniformly. 
If  we  examine  the  total  number  of  user  operations  we  find  that  the  Async  engine  does  more  than  twice  the 
work  of  the  Async+S  engine. 

Finally,  we  evaluate  the  Async  and  the  Async+S  engines  on  a  popular  machine  learning  algorithm: 
Alternating  Least  Squares  (ALS).  The  ALS  algorithm  has  a  number  of  variations  which  allow  it  to  be 
used  in  a  wide  range  of  applications  including  user  personalization  [Zhou  et  al.,  2008]  and  document 
semantic  analysis  [Hofmann,  1999].  We  apply  ALS  to  the  Wikipedia  term-document  graph  consisting  of 
1 1M  vertices  and  315M  edges  to  extract  a  mixture  of  topics  representation  for  each  document  and  term. 
The  number  of  topics  d  is  a  free  parameter  that  determines  the  computational  complexity  O  (d3)  of  each 
vertex-program.  In  Figure  6.13a,  we  plot  the  ALS  throughput  on  the  Async  engine  and  the  Async+S  engine. 
While  the  throughput  of  the  Async  engine  is  greater,  the  gap  between  engines  shrinks  as  d  increases  and 
computation  dominates  the  consistency  overhead.  To  demonstrate  the  importance  of  serializability,  we  plot 
in  Figure  6.13b  the  training  error,  a  measure  of  solution  quality,  for  both  engines.  We  observe  that  while 
the  Async  engine  has  greater  throughput,  the  Async+S  engine  converges  faster. 

The  complexity  of  the  Async+S  engine  is  justified  by  the  necessity  for  serializability  in  many  applications 
(e.g.,  ALS).  Furthermore,  serializability  adds  predictability  to  the  nondeterministic  asynchronous  execution. 
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PageRank 

Runtime 

\E\ 

System 

Kang  et  al.  [2009] 

198s 

- 

1.1B 

50x8 

Zaharia  et  al.  [2010] 

97.4s 

40M 

1.5B 

50x2 

Ekanayake  et  al.  [2010] 

36s 

50M 

1.4B 

64x4 

PowerGraph  (Sync) 

3.6s 

40M 

1.5B 

64x8 

Triangle  Count 

Runtime 

\v\ 

\E\ 

System 

Suri  and  Vassilvitskii  [2011] 

423m 

40M 

1.4B 

1636x? 

PowerGraph  (Sync) 

1.5m 

40M 

1.4B 

64x16 

LDA 

Tok/sec 

Topics 

System 

Smola  and  Narayanamurthy  [2010] 

150M 

1000 

100x8 

PowerGraph  (Async) 

110M 

1000 

64x16 

Table  6.2:  Relative  performance  of  PageRank,  triangle  counting,  and  LDA  on  similar  graphs.  PageRank 
runtime  is  measured  per  iteration.  Both  PageRank  and  triangle  counting  were  run  on  the  Twitter  follower 
network  and  LDA  was  run  on  Wikipedia.  The  systems  are  reported  as  number  of  nodes  by  number  of  cores. 


For  example,  even  graph  coloring  may  not  terminate  on  dense  graphs  unless  serializability  is  ensured. 


6.7.5  Fault  Tolerance 

Like  GraphLab  and  Pregel,  PowerGraph  achieves  fault-tolerance  by  saving  a  snapshot  of  the  data-graph. 
The  synchronous  PowerGraph  engine  constructs  the  snapshot  between  super-steps  and  the  asynchronous 
engine  suspends  execution  to  construct  the  snapshot.  An  asynchronous  snapshot  using  GraphLab’s  snapshot 
algorithm  can  also  be  implemented.  The  checkpoint  overhead,  typically  a  few  seconds  for  the  largest 
graphs  we  considered,  is  small  relative  to  the  running  time  of  each  application. 


6.7.6  MLDM  Applications 

To  assess  the  relative  performance  of  PowerGraph  we  compare  against  published  results  on  several  real- 
world  problems  and  datasets.  In  Table  6.2  we  provide  comparisons  of  the  PowerGraph  system  with 
published  results  on  similar  data  for  PageRank,  Triangle  Counting  [Suri  and  Vassilvitskii,  2011],  and 
collapsed  Gibbs  sampling  for  the  LDA  model  [Smola  and  Narayanamurthy,  2010]. 

We  compared  our  PageRank  per  iteration  runtime  with  published  results  on  other  systems  and  were  able  to 
achieve  order  of  magnitude  gains  compared  to  published  results.  However,  to  provide  a  fair  comparison  we 
did  not  take  advantage  of  dynamic  or  asynchronous  scheduling  which  would  require  accuracy  information 
for  the  published  results.  In  Figure  6.14a  we  compare  the  PageRank  algorithm  when  run  synchronously, 
asynchronously,  and  asynchronously  with  dynamic  scheduling  (following  the  GrAD  methodology).  The 
dynamic  asynchronous  scheduling  substantially  accelerates  convergence  (reduction  in  error).  To  get  a  sense 
of  the  importance  of  dynamic  scheduling  we  also  plot  the  number  of  vertices  against  the  update  count  in 
Figure  6.14b.  More  than  50%  of  the  vertices  are  updated  only  once. 

The  triangle  counting  application  computes  for  each  vertex  the  number  of  neighbors  that  are  also  connected 
(form  a  triangle).  Triangle  counting  is  used  to  estimate  the  strength  of  the  community  around  a  vertex.  Users 
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with  a  greater  triangles  to  neighbors  ratio  have  a  tighter  community.  Triangle  counting  was  implemented  in 
the  PowerGraph  abstraction  by  using  the  gather  phase  to  collect  the  set  of  neighboring  vertices  and  the 
scatter  phase  to  compute  and  store  the  intersection  of  the  neighborhood  set  on  each  edge.  In  contrast,  Suri 
and  Vassilvitskii  [201 1]  used  Hadoop  and  were  forced  to  emit  a  large  number  of  intermediate  key- value 
pairs  (high  fan-out)  flooding  the  network.  As  a  consequence  PowerGraph  is  over  two  orders  of  magnitude 
faster. 

For  LDA,  the  state-of-the-art  solution  is  a  heavily  optimized  system  designed  for  this  specific  task  by 
Smola  and  Narayanamurthy  [2010].  We  implemented  an  asynchronous  version  of  the  collapsed  Gibbs 
sampler  for  LDA  similar  to  the  one  described  by  Smola  and  Narayanamurthy  [2010].  Neither  samplers  are 
ergodic  but  both  generate  reasonable  topics  relatively  quickly.  The  PowerGraph  sampler  enforces  stronger 
local  consistency  guarantees  and  could  potentially  converge  faster  for  strongly  correlated  documents. 
Nonetheless,  we  compared  both  samplers  in  terms  of  the  sampling  throughput.  We  find  that  PowerGraph  is 
able  to  achieve  comparable  performance  using  only  200  lines  of  user  code. 


6.8  Additional  Related  Work 


The  vertex-cut  approach  to  distributed  graph  placement  is  related  to  work  [Catalyurek  and  Aykanat,  1996, 
Devine  et  al.,  2006]  in  hypergraph  partitioning.  In  particular,  a  vertex-cut  problem  can  be  cast  as  a 
hypergraph-cut  problem  by  converting  each  edge  to  a  vertex,  and  each  vertex  to  a  hyper-edge.  However, 
existing  hypergraph  partitioning  can  be  very  time  intensive.  While  our  cut  objective  is  similar  to  the 
“communication  volume”  objective,  the  streaming  vertex  cut  setting  described  in  this  chapter  is  novel. 
Stanton  and  Kliot  [201 1]  developed  several  heuristics  for  the  streaming  edge-cuts  but  do  not  consider  the 
vertex-cut  problem. 

Several  [  Buluc  and  Gilbert,  2011,  Kang  et  al.,  2009]  have  proposed  generalized  sparse  matrix  vector 
multiplication  as  a  basis  for  graph-parallel  computation.  These  abstractions  operate  on  commutative 
associative  semi-rings  and  therefore  also  have  generalized  gather  and  sum  operations.  However,  they  do 
not  support  the  more  general  apply  and  scatter  operations,  as  well  as  mutable  edge-data  and  are  based  on  a 
strictly  synchronous  model  in  which  all  computation  is  run  in  every  iteration. 

While  we  discuss  Pregel  and  GraphLab  in  detail,  there  are  other  similar  graph-parallel  abstractions.  Closely 
related  to  Pregel  is  BPGL  [Gregor  and  Lumsdaine,  2005]  which  implements  a  synchronous  traveler  model. 
Signal/Collect  [Stutz  et  al.,  2010]  is  an  asynchronous  messaging  abstraction  with  dynamic  scheduling. 
However  signal/collect  is  a  shared  memory  abstraction.  Alternatively,  Kineograph  [Cheng  et  al.,  2012] 
presents  a  graph-parallel  framework  for  time-evolving  graphs  which  mixes  features  from  both  GraphLab 
and  Piccolo.  Pujol  et  al.  [2010]  present  a  distributed  graph  database  but  do  not  explicitly  consider  the 
power-law  structure.  Finally,  Kyrola  et  al.  [2012]  present  GraphChi:  an  efficient  single-machine  disk-based 
implementation  of  the  GraphLab  abstraction.  Impressively,  it  is  able  to  significantly  out-perform  large 
Hadoop  deployments  on  many  graph  problems  while  using  only  a  single  machine:  performing  one  iteration 
of  PageRank  on  the  Twitter  Graph  in  only  158s  (PowerGraph:  3.6s).  The  techniques  described  in  GraphChi 
can  be  used  to  add  out-of-core  storage  to  PowerGraph. 
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6.9  Conclusion 


The  need  to  reason  about  large-scale  graph-structured  data  has  driven  the  development  of  new  graph-parallel 
abstractions  such  as  GraphLab  and  Pregel.  However  graphs  derived  from  real-world  phenomena  often 
exhibit  power-law  degree  distributions,  which  are  difficult  to  partition  and  can  lead  to  work  imbalance  and 
substantially  increased  communication  and  storage. 

To  address  these  challenges,  we  introduced  the  PowerGraph  abstraction  which  exploits  the  Gather-Apply- 
Scatter  model  of  computation  to  factor  vertex -programs  over  edges,  splitting  high-degree  vertices  and 
exposing  greater  parallelism  in  natural  graphs.  We  then  introduced  vertex-cuts  and  a  collection  of  fast 
greedy  heuristics  to  substantially  reduce  the  storage  and  communication  costs  of  large  distributed  power-law 
graphs.  We  theoretically  related  the  power-law  constant  to  the  communication  and  storage  requirements  of 
the  PowerGraph  system  and  empirically  evaluate  our  analysis  by  comparing  against  GraphLab  and  Pregel. 
Finally,  we  evaluate  the  PowerGraph  system  on  several  large-scale  problems  using  a  64  node  EC2  cluster 
and  demonstrating  the  scalability  and  efficiency  and  in  many  cases  order  of  magnitude  gains  over  published 
results. 

We  are  actively  using  PowerGraph  to  explore  new  large-scale  machine  learning  algorithms.  We  are 
beginning  to  study  how  vertex  replication  and  data-dependencies  can  be  used  to  support  fault-tolerance 
without  check-pointing.  In  addition,  we  are  exploring  ways  to  support  time-evolving  graph  structures. 
Finally,  we  believe  that  many  of  the  core  ideas  in  the  PowerGraph  abstraction  can  have  a  significant  impact 
in  the  design  and  implementation  of  graph-parallel  systems  beyond  PowerGraph. 
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(a)  Twitter  PageRank  Throughput 


(b)  Twitter  PageRank  Delta  Cache 


Figure  6.12:  Asynchronous  Experiments  (a)  Number  of  user  operations  (gather/apply/scatter)  issued  per 
second  by  Dynamic  PageRank  as  #  machines  is  increased,  (b)  Total  number  of  user  ops  with  and  without 
caching  plotted  against  time,  (c)  Weak  scaling  of  the  graph  coloring  task  using  the  Async  engine  and  the 
Async+S  engine  (d)  Proportion  of  non-conflicting  edges  across  time  on  a  8  machine,  40M  vertex  instance 
of  the  problem.  The  green  line  is  the  rate  of  conflicting  edges  introduced  by  the  lack  of  consistency  (peak 
236K  edges  per  second)  in  the  Async  engine.  When  the  Async+S  engine  is  used  no  conflicting  edges  are 
ever  introduced. 
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Figure  6. 13:  (a)  The  throughput  of  ALS  measured  in  millions  of  User  Operations  per  second,  (b)  Training 
error  (lower  is  better)  as  a  function  of  running  time  for  ALS  application. 


(a)  Page  Rank  Update  Counts  (b)  Page  Rank  Update  Counts 

Figure  6.14:  PageRank  Updates  and  Convergence:  To  assess  the  gains  of  applying  the  GrAD  methodol¬ 
ogy  to  PageRank  we  compare  synchronous,  asynchronous,  and  asynchronous  +  dynamic  implementations 
achieved  by  using  the  synchronous  and  asynchronous  engine.  We  plot  (in  log-scale)  the  L\  error  (a)  relative 
to  the  true  PageRank  (computed  by  running  synchronous  PageRank  to  convergence)  as  a  function  of  vertex 
updates.  To  get  a  sense  of  the  importance  of  dynamic  scheduling  we  plot  (b)  the  number  of  vertices  against 
the  number  of  times  those  vertices  were  updated.  More  than  50%  of  the  vertices  are  only  updated  once. 
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Chapter  7 


Conclusion  and  Observations 


Probabilistic  reasoning  lies  at  the  foundation  of  machine  learning  and  statistics  and  is  essential  to  extracting 
signal  from  noisy  data.  This  thesis  explored  the  task  of  designing  parallel  and  distributed  algorithms  and 
systems  for  probabilistic  reasoning.  We  proposed  the  three  design  principles: 

1.  Graphical:  Explicitly  represent  statistical  and  computational  dependencies  as  a  sparse  graph. 

2.  Asynchronous:  Allow  computation  to  proceed  asynchronously  as  dependencies  are  satisfied  obeying 
the  dependency  structure  encoded  in  the  graph. 

3.  Dynamic:  Adaptively  identify  and  prioritize  sub-problems  that  when  solved  make  the  most  progress 
towards  the  solution. 

The  Graphical,  Asynchronous,  and  Dynamic  design  principles  form  the  basis  of  the  GrAD  design  method¬ 
ology  which  lead  to  the  thesis  statement: 

By  factoring  large-scale  probabilistic  computation  with  respect  to  the  underlying  graphical 
structure,  asynchronously  addressing  sequential  dependencies,  and  dynamically  scheduling 
computation  we  can  construct  efficient  parallel  and  distributed  systems  capable  of  scalable 
probabilistic  reasoning. 

We  evaluated  the  thesis  statement  by  first  developing  a  series  of  efficient  parallel  and  distributed  proba¬ 
bilistic  inference  algorithms.  In  Chapter  3  we  studied  parallel  and  distributed  algorithms  for  loopy  belief 
propagation.  We  demonstrated  both  theoretically  and  experimentally  that  the  natural  fully  synchronous  par¬ 
allelization  can  be  highly  inefficient.  We  then  designed  a  series  of  alternative  parallelizations  by  exploiting 
asynchrony  and  dynamic  scheduling.  Through  this  process  we  developed  analytical  tools  to  characterize  the 
flow  of  information  along  the  graph  and  the  relationship  between  model  parameters  and  the  approximate 
sequential  structure  of  message  passing  inference.  In  the  end  we  arrived  at  the  SplashBP  algorithm  which 
employs  optimal  tree  scheduling  along  with  advances  in  asynchronous  dynamic  scheduling  to  achieve  an 
efficient  scalable  parallel  and  distributed  inference  algorithm.  We  implemented  each  of  these  algorithms 
in  the  multicore  and  distributed  settings  and  demonstrated  how  applying  the  GrAD  design  principles  can 
accelerate  convergence,  improve  stability,  and  lead  to  scalable  systems  for  probabilistic  reasoning. 

Motivated  by  our  work  on  loopy  belief  propagation,  in  Chapter  4  we  applied  the  GrAD  design  principles  to 
the  widely  used  Gibbs  sampling  algorithm  from  Bayesian  statistics.  We  again  demonstrated  that,  the  natural 
synchronous  parallelization  leads  to  a  Gibbs  sampler  which  converges  to  the  wrong  stationary  distribution. 
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We  then  applied  the  GrAD  design  principles  to  develop  three  parallel  Gibbs  samplers  which  are  guaranteed 
to  accelerate  mixing  while  simultaneously  preserving  the  correct  stationary  distribution.  Through  this 
process  we  also  developed  a  set  of  basic  analytical  tools  that  relate  properties  of  the  graph  coloring  to  the 
parallel  structure  of  Gibbs  sampling  as  well  as  the  stationary  distribution  of  the  incorrect  Synchronous 
Gibbs  sampler.  To  address  settings  with  tightly  coupled  variables  we  introduced  the  Splash  Gibbs  sampler 
which  incrementally  constructs  multiple  conditionally  independent  bounded  treewidth  blocks  (Splashes) 
in  parallel.  We  accelerate  burn-in  by  dynamically  growing  the  Splash  tree  structure.  We  implemented 
and  evaluated  the  proposed  samplers  in  the  multicore  setting  an  demonstrated  strong  parallel  scaling  and 
accelerated  burn-in  and  mixing. 

After  validating  the  thesis  approach  by  designing  and  implementing  a  collection  of  efficient  parallel 
and  distributed  inference  algorithms,  we  introduced  GraphLab  (Chapter  5)  which  generalizes  our  earlier 
system  design  efforts  and  specifically  targets  algorithms  that  follow  the  GrAD  design  principles.  The 
GraphLab  abstractions  decomposes  the  program  state  and  computation  along  the  structure  of  the  graph, 
introduces  a  simple  sequential  shared-memory  model  of  graph  computation,  and  supports  a  range  of 
powerful  dynamic  scheduling  primitives.  As  a  consequence  GraphLab,  allows  machine  learning  experts  to 
compactly  represent  sophisticated  asynchronous  dynamic  graph  algorithms  for  probabilistic  reasoning.  The 
GraphLab  system  then  executes  those  algorithms  efficiently  on  multicore  and  distributed  systems  using 
many  of  the  system  design  techniques  developed  in  our  work  on  probabilistic  inference.  We  experimentally 
assessed  the  GraphLab  system  on  a  wide  range  of  real-world  applications  demonstrating  strong  scalability 
and  performance;  further  validating  the  thesis  approach  on  an  even  wider  range  of  machine  learning  and 
data  mining  applications. 

While  GraphLab  generalized  the  thesis  approach  to  a  much  broader  class  of  Machine  Learning  and  Data 
Mining  algorithms,  as  we  scaled  to  increasingly  larger  graphs  with  billions  of  vertices  and  edges  we 
discovered  a  critical  performance  barrier.  In  Chapter  6,  we  showed  that  the  power-law  sparsity  structure 
found  in  the  majority  of  real-world  graphs  leads  to  substantial  communication  and  storage  overhead 
for  existing  graph  computation  systems,  including  GraphLab.  To  address  this  challenge  we  introduced 
PowerGraph  which  refined  the  GraphLab  abstraction  to  further  decompose  vertex  programs  along  edges 
in  the  graph  and  thereby  enable  a  vertex-cut  based  distributed  graph  representation.  As  a  consequence 
we  were  able  to  leverage  the  power-law  structure  to  improve  scalability  and  reduce  communication  and 
storage  overhead.  We  implemented  PowerGraph  in  the  multicore  and  distributed  setting  and  then  ported 
most  of  the  GraphLab  algorithms  to  the  PowerGraph  system.  Finally  we  theoretically  and  experimentally 
demonstrated  that  the  PowerGraph  system  is  capable  of  achieving  orders-of-magnitude  gains  in  scalability 
and  performance  over  existing  systems. 

Throughout  the  thesis  work,  we  observed  several  repeated  patterns  that  go  beyond  the  thesis  statement.  We 
now  discuss  some  of  these  common  patterns  and  their  impact  on  our  design  decisions  and  the  performance 
of  the  resulting  algorithms  and  systems  as  well  as  their  implications  on  future  work. 


7.1  Modeling  Through  Recurrence  Relationships 

Whether  computing  the  marginal  posterior  of  some  variable  or  estimating  the  centrality  of  a  user  in  a  social 
network,  the  algorithms  considered  in  this  thesis  defined  a  recurrence  relationship  between  a  vertex  and  its 
neighbors  and  then  specified  a  method  to  resolve  that  recurrence  relationship.  In  most  cases,  the  algorithms 
iteratively  satisfy  the  relationships  locally  and  then,  if  necessary,  repair  the  relationship  on  neighboring 
vertices. 
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The  graphical  recurrence  relationship  simplifies  the  process  of  designing  complex  models  for  large  structured 
data.  It  is  surprisingly  intuitive  to  express  complex  models  in  terms  of  local  relationships.  For  example,  the 
popularity  of  a  user  in  a  social  network  can  be  succinctly  and  intuitively  expressed  as  a  simple  function  of 
the  popularity  of  her  followers.  By  assuming  we  have  successfully  modeled  our  neighbors  in  the  graph, 
we  can  introduce  a  simple  recurrence  step  to  extend  the  model  to  ourselves.  Moreover,  by  decomposing 
the  modeling  task  to  that  of  modeling  local  relationships  we  simplify  the  task  of  building  large  complex 
models. 

The  process  of  iteratively  maintaining  the  graphical  recurrence  relationship  is  well  suited  for  graph-parallel 
computation.  Because  the  recurrence  relationship  can  be  expressed  and  maintained  by  only  accessing  the 
neighborhood  of  a  vertex,  we  are  able  to  factor  computation  along  vertices  exposing  substantial  parallelism 
and  locality.  As  a  consequence,  we  can  more  effectively  parallelize  these  algorithms  and  represent  the 
program  state  in  a  distributed  environment. 

The  recurrence  relationship  approach  to  modeling  and  algorithm  design  in  naturally  amenable  to  dynamic 
scheduling.  By  defining  a  measure  of  the  extent  to  which  a  recurrence  is  violated,  we  can  associate  priorities 
with  each  vertex.  Then  by  focusing  computational  resources  on  the  most  violated  recurrence  relationships 
and  propagating  corrections  along  wavefronts  we  can  often  accelerate  convergence  of  these  algorithms.  We 
believe  that  future  work  in  the  development  of  and  application  of  prioritized  scheduling  will  produce  in  a 
range  of  more  efficient  parallel  and  distributed  algorithms. 


7.2  Tradeoff  between  Parallelism  and  Convergence 


By  decomposing  computation  along  vertices  in  a  graph,  graph-parallel  algorithms  expose  substantial 
parallelism.  In  principle,  we  could  run  computation  on  every  single  vertex  simultaneously.  This  realization 
forms  the  basis  of  existing  graph-parallel  systems  and  has  contributed  to  the  re-emergence  of  the  bulk 
synchronous  model  of  graphical  computation.  However,  as  we  have  demonstrated  in  this  thesis  work, 
such  an  extreme  approach  to  parallelism  leads  to  inefficient  algorithms  and  in  some  cases  inaccurate 
solutions. 

The  graphical  structure  of  graph-parallel  algorithms  induces  an  inherent  sequential  component  to  their 
execution.  At  a  high-level,  computing  the  property  of  a  vertex  typically  requires  knowing  the  correct  value 
for  its  neighboring  vertices.  This  sequential  dependency  structure  is  most  apparent  in  chain  graphs  and  in 
Chapter  3  we  discussed  the  role  of  chains  of  dependencies  in  the  convergence  of  loopy  belief  propagation. 
Alternatively,  in  Chapter  4  we  discussed  how  neighboring  variables  must  be  updated  sequentially  to  prove 
algorithm  correctness. 

In  Section  3.3  we  demonstrated  that  the  inefficiency  that  arises  from  not  addressing  the  sequential  depen¬ 
dencies  can  result  in  asymptotically  slower  convergence.  However,  by  running  computation  on  multiple 
vertices  we  are  able  to  increase  the  computation  throughput.  We  achieve  a  balance  between  the  rate 
of  convergence  and  the  increased  throughput  due  to  parallelism  by  asynchronously  prioritizing  parallel 
resources.  In  addition  we  schedule  computation  along  spanning  trees  and  ensure  a  serializable  execution 
that  is  as  close  to  the  optimal  sequential  execution  as  possible. 

Similarly,  in  Chapter  4  we  showed  that  the  fully  parallel  synchronous  Gibbs  sampler  converges  to  the 
wrong  distribution.  Again,  by  running  vertex  computation  asynchronously  and  ensuring  serializability 
we  were  able  to  obtain  a  parallel  execution  that  converges  to  the  correct  posterior  distribution.  Finally, 
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by  moving  computation  along  wavefronts  defined  by  spanning  trees  in  the  graph  we  were  able  to  further 
accelerate  convergence. 

By  restricting  our  attention  to  serializable  executions  along  dynamically  prioritized  wavefronts  we  are 
substantially  reducing  the  amount  of  available  parallelism.  Nonetheless,  for  most  of  the  problems  we 
considered  there  was  typically  enough  remaining  parallelism  to  continue  to  obtain  relatively  strong  scaling. 
However,  as  we  moved  to  large  distributed  systems  and  very  high-degree  vertices  (Chapter  6),  there  were 
cases  where  maintaining  a  serializable  execution  eliminated  most  of  the  vertex-level  parallelism.  For 
example,  while  executing  the  vertex  program  on  the  center  vertex  of  a  star  graph  no  other  vertex  program 
may  run.  In  these  cases  we  were  able  to  extract  additional  edge-level  parallelism  in  the  gather  and  scatter 
phases  of  the  GAS  decomposition.  In  addition,  the  PowerGraph  abstraction  enabled  a  substantially  more 
efficient  distributed  graph  representation  as  well  as  the  ability  to  exploit  edge-level  parallelism  in  the 
distributed  setting. 


7.3  Approximations  and  Parallelism 


In  our  work  on  belief  propagation  we  revealed  the  connection  between  approximate  convergence  and 
parallelism.  In  particular,  we  showed  that  the  limiting  sequential  component,  defined  by  the  parameter  re, 
can  be  related  to  the  accuracy  of  the  fix-point  approximation.  By  increasing  the  tolerance  at  convergence 
we  can  effectively  increase  the  number  of  processor  resources  that  can  be  used  efficiently.  However,  the 
resulting  algorithms  did  not  depend  on  the  re  parameter  and  instead  prioritized  the  use  of  parallel  resources 
allowing  the  available  resources  to  be  used  as  efficiently  as  possible.  Consequently,  we  expect  to  see  a 
diminishing  returns  as  parallel  resources  exceed  the  efficient  level  of  parallelism  in  the  model. 

In  our  work  on  Gibbs  sampling  we  showed  that  the  limiting  sequential  component  depends  on  the  chromatic 
number  of  the  underlying  Markov  Random  field.  However,  unlike  the  re  parameter  in  belief  propagation, 
the  chromatic  number  can  be  directly  estimated  from  the  Markov  Random  Field  and  the  precise  available 
parallelism  and  acceleration  in  mixing  can  be  quantified.  In  the  case  of  models  with  two  colorable  Markov 
Random  Fields,  we  can  obtain  a  fully  synchronous  approximation  which  introduces  strong  independence 
assumptions  but  also  exposes  the  maximum  available  parallelism.  Unfortunately  we  were  unable  to  extend 
this  analysis  to  Markov  Random  Fields  with  higher  chromatic  numbers.  However,  recent  state-of-the-art 
systems  [Ahmed  et  al.,  2012,  Smola  and  Narayanamurthy,  2010]  for  approximate  Gibbs  sampling  have 
demonstrated  the  advantages  of  asynchronous  computation  and  the  substantial  additional  parallelism 
achieved  through  approximations. 


7.4  Challenges  of  Dynamic  Scheduling 


Throughout  our  work  we  found  dynamic  scheduling  to  be  one  of  the  more  challenging  principles  of 
the  thesis  methodology.  To  apply  dynamic  scheduling  we  had  to  overcome  the  challenges  of  defining 
an  informative  measures  of  future  progress  as  well  as  address  the  system  challenges  associated  with 
dynamically  evolving  computation.  In  addition,  we  had  to  deal  with  the  interaction  between  the  dynamic 
heuristic  and  the  asynchronous  distributed  nature  of  its  execution. 

While  the  recursive  fixed-point  formulation  of  the  vast  majority  of  algorithms  leads  to  a  natural  measure 
of  local  convergence,  it  is  not  always  the  optimal  measure  of  future  progress.  For  example  in  our  work 
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on  belief  propagation  (Section  3.6.2)  we  found  that  an  alternative  formulation  of  local  convergence  can 
improve  performance  and  help  reduce  asymmetric  convergence.  Intuitively,  small  changes  in  one  vertex 
could  potentially  lead  to  large  changes  in  neighboring  vertices.  We  believe  that  more  work  into  the  theory 
of  adaptive  prioritization  will  lead  to  more  general  methods  to  construct  prioritization. 

Designing  and  building  the  data  structures  to  manage  priorities  presented  substantial  challenges.  While  there 
is  existing  work  in  parallel  priority  queue  design  [Rao  and  Kumar,  1988,  Shavit  and  Zemach,  1999]  their 
arc  no  widely  available  implementations  and  existing  data-structures  typically  assume  integral  or  uniform 
priorities.  In  our  work,  we  found  that  since  the  priorities  were  directly  coupled  with  the  convergence 
of  each  vertex  the  range  of  priorities  rapidly  decreased  making  it  difficult  to  apply  more  classic  binning 
techniques.  A  reasonable  compromise  which  seems  to  address  the  limitations  in  defining  priorities  as 
well  as  the  scalability  of  scheduling  data-structures  is  to  reduce  the  priority  to  the  binary  converged  or 
not  converged  states  as  was  done  in  (Section  3.5).  While  a  better  priority  definition  can  lead  to  improved 
performance,  this  simple  form  of  dynamic  scheduling  is  easy  to  define  and  implement  at  scale. 

Whether  constructing  an  edge-cut  (e.g.,  as  in  GraphLab  Section  5.7.1)  or  constructing  a  vertex-cut  (e.g., 
as  in  PowerGraph  Section  6.5)  we  are  balancing  work  by  evenly  distributing  the  graph  across  machines. 
However,  with  dynamic  scheduling  an  evenly  distributed  graph  may  lead  to  an  imbalance  in  computation. 
In  Section  3.8.1  we  proposed  over-partitioning  as  a  method  to  improve  work-balance.  However,  over¬ 
partitioning  leads  to  increased  communication  over-head  which  for  many  applications  (e.g.,  PageRank, 
Shortest  Path,  CoEM,  Graph  Coloring)  is  the  primary  bottleneck.  While  dynamic  load  balancing  may 
provide  a  solution,  it  is  difficult  to  predict  the  future  work  distribution  given  the  current  work  distribution. 
We  believe  that  form  of  distributed  work  stealing  similar  to  that  described  by  Power  and  Li  [2010]  could 
potentially  address  these  issues. 


7.5  Importance  of  High-Level  Abstractions 


High-level  abstractions  are  essential  to  isolating  the  design  and  implementation  of  sophisticated  algorithms 
from  the  design  and  implementation  of  complex  systems.  Abstractions  like  MapReduce  or  even  SQL  have 
enabled  people  with  limited  or  no  system  design  experience  to  design  and  implement  specific  computation 
patterns  that  efficiently  leverage  massive  distributed  systems  and  sophisticated  data  representations  with 
minimal  effort  and  training.  In  contrast,  in  the  absence  of  high-level  abstractions  the  high-performance 
computing  community  has  struggled  with  slower  development  cycles  and  algorithms  and  implementations 
that  are  built  around  individual  machines.  While  MPI  has  excelled  at  isolating  the  design  and  implementa¬ 
tion  of  the  communication  layer  from  the  algorithm  and  tools  like  Zoltan  have  helped  isolate  the  challenges 
of  graph  placement,  each  implementation  still  requires  substantial  system  design  and  tuning  effort.  For 
example,  in  our  work  on  graphical  model  inference  it  took  several  years  to  design,  implement,  and  tune  the 
shared  and  distributed  memory  implementations  of  our  algorithms  using  pthreads,  MPI,  and  Metis. 

However,  as  we  demonstrated,  choosing  the  wrong  high-level  abstraction  can  lead  to  substantial  inefficien¬ 
cies  both  in  the  abstraction  assumptions  as  well  as  the  requirements  it  imposes  on  the  computation.  For 
example,  implementing  GrAD  algorithms  in  the  MapReduce  abstraction  leads  to  excessive  disk-overhead 
in  iterative  computation  and  forces  the  user  to  over-express  parallelism  in  the  form  of  synchronous  compu¬ 
tation.  By  identifying  the  key  patterns  that  emerge  across  a  wide  range  of  algorithms  we  were  able  to  define 
a  new  high-level  abstraction  which  matches  the  needs  and  assumptions  of  the  GrAD  algorithms. 

Our  work  on  GraphLab  and  then  PowerGraph  demonstrated  that  we  could  both  express  a  wide  range  of 
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algorithms  using  a  targeted  abstraction  and  then  efficiently  implement  that  abstraction  in  the  multicore  and 
distributed  settings.  Furthermore,  we  observed  that  focused  effort  on  the  design  and  implementation  of 
the  abstraction  lead  to  performance  gains  in  the  many  applications  built  on  top  of  the  abstraction.  In  some 
cases  algorithms  built  on  top  of  GraphLab  and  PowerGraph  outperformed  their  less  optimized  counterparts 
built  using  the  low-level  primitives  from  high-performance  computing. 

We  believe  that  by  developing  high-level  abstractions  around  common  computational  patterns  we  can 
isolate  the  challenges  of  algorithm  and  system  design  while  enabling  the  construction  of  sophisticated 
algorithms  that  efficiently  realize  the  full  potential  of  the  available  system  resources.  As  a  consequence 
we  can  substantially  shorten  the  development  cycle  for  high-performance  systems,  enable  systems  experts 
to  simultaneously  accelerate  a  wide  range  of  sophisticated  algorithms,  and  allow  domain  experts  to  more 
readily  leverage  the  immense  computational  resources  available  in  parallel  and  distributed  computing 
systems. 
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Chapter  8 


Future  Work 


This  thesis  is  a  first  step  towards  a  larger  research  agenda  in  scalable  probabilistic  reasoning.  The  GrAD 
design  principles  and  the  resulting  algorithms  and  high-level  abstractions  form  a  foundation  on  which  to 
build  more  sophisticated  algorithms,  theoretical  tools,  and  systems.  In  this  section  we  describe  several 
potential  avenues  for  further  research  based  on  our  findings  and  experiences. 


8.1  Scalable  Online  Probabilistic  Reasoning 


In  many  real-world  applications  the  graph  structure  and  meta-data  are  rapidly  evolving.  For  example  in  the 
context  of  social  networks,  as  time  progresses  new  users  join,  make  friends,  and  change  their  profiles.  As 
new  data  arrives  and  existing  data  is  modified  we  would  like  to  be  able  to  continually  update  and  refine  our 
models  (e.g.,  best  estimates  of  users  interests).  We  believe  that  the  GrAD  methodology  will  be  essential  in 
these  situations  since  small  changes  to  the  graph  should  lead  to  localized  computation  that  can  be  prioritized 
and  executed  asynchronously. 

The  online  setting  carries  a  few  new  challenges.  First  data  is  arriving  rapidly  and  must  be  incorporated 
into  the  graph  structure.  The  graph  representations  used  in  this  work  make  assumptions  about  the  ordering 
of  vertices  and  their  placement  in  distributed  environments  that  arc  difficult  to  maintain  if  the  structure  is 
changing  rapidly.  Furthermore  the  semantics  of  graph  changes  are  not  well  defined.  For  example,  if  an 
edge  is  added  or  deleted  arc  both  vertex  programs  run?  When  are  structural  changes  made  visible  to  vertex 
programs?  Decisions  about  these  semantics  will  impact  the  types  of  algorithms  that  may  be  expressed  as 
well  as  the  guarantees  provided  by  the  system. 

The  online  setting  typically  assumes  that  computation  runs  continuously  for  extended  periods  of  time 
requiring  not  only  fault-tolerance  but  also  high-availability.  While  we  have  achieved  a  basic  form  of 
fault  tolerance  through  the  use  of  both  synchronously  and  asynchronously  constructed  snapshots,  in 
the  event  of  a  machine  failure  we  must  stop  all  execution  and  restart  at  the  last  snapshot.  In  an  online 
system,  data  is  arriving  continuously  and  must  be  incorporated  continuously.  If  a  machine  fails,  we 
must  ensure  that  any  recent  data  is  not  lost  and  that  we  can  still  incorporate  new  data  as  it  continues  to 
arrive.  This  form,  availability  is  especially  challenging  to  maintain  in  a  consistent  manner  when  there  arc 
complex  data  dependencies  (i.e.,  the  graph).  While  Pujol  et  al.  [2010]  began  to  address  this  problem,  we 
believe  that  their  are  exciting  new  opportunities  in  the  development  of  highly  available  distributed  graph 
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data-structures. 


There  has  been  only  limited  work  in  the  development  of  online  structured  probabilistic  reasoning.  Hoffman 
et  al.  [2010]  described  a  method  to  accelerate  inference  in  the  LDA  model  by  extending  the  model  and 
augmenting  a  variational  approximation  as  variables  are  added.  However,  Hoffman  et  al.  [2010]  did  not 
consider  the  case  in  which  the  variables  are  being  generated  from  an  external  source  and  the  model  is 
defined  as  the  fixed-point  of  a  recursive  relationship.  In  these  cases  we  might  like  to  continue  to  maintain 
end  extend  the  model  as  well  as  the  fixed-point  solution.  We  believe  that  by  applying  reasoning  similar  to 
the  Te-approximation  we  might  be  able  to  show  that  changes  in  the  graph  have  a  localized  effect.  Finally, 
as  the  model  structure  changes,  we  need  a  principle  method  to  introduce  new  parameters  and  factors  and 
potentially  relearn  the  parameters  already  in  the  model. 

Our  work  on  the  PowerGraph  abstraction  demonstrated  the  potential  for  exploiting  the  sparsity  pattern 
present  in  real-world  problems.  We  also  believe  that  a  similarly  exploitable  arrival  pattern  is  likely  to 
be  present  in  a  wide  range  of  real-world  problems.  For  example,  if  we  believe  our  power-law  graph  was 
derived  from  a  preferential  attachment  phenomena  then  we  might  expect  edges  to  be  added  to  high-degree 
vertices  disproportionately  often.  Alternatively,  because  people’s  online  interactions  are  often  bursty  we 
might  expect  changes  to  vertex  meta-data  to  have  a  similar  bursty  pattern.  By  tuning  caching  schemes  and 
consistency  models  for  the  arrival  patterns  in  real-world  data  we  may  be  able  to  build  more  efficient  online 
graph  processing  systems. 


8.2  Learning  a  Scalable  Parsimonious  Structures 

We  have  repeatedly  demonstrated  how  the  graphical  structure  of  data  dependencies  affects  the  convergence 
and  parallel  scaling  of  structured  probabilistic  reasoning.  In  this  work  we  have  assumed  that  the  structural 
dependencies  are  fixed  and  provided  as  input.  In  many  cases  this  is  a  reasonable  assumption  since  the 
model  structure  is  coupled  to  relational  observations  like  friendship,  spatial  proximity,  or  co-occurrence. 
However,  in  some  cases  we  may  not  know  the  structure  in  advance  or  may  be  willing  to  accept  a  simpler 
structure  both  in  pursuit  of  model  parsimony  as  well  as  increased  scalability.  This  observation  leads  to  the 
natural  questions:  can  we  regularize  for  scalability  and  thereby  efficiently  learn  scalable  structures ? 

There  are  several  challenges  to  scalable  structure  learning.  First,  in  general  structure  learning  is  computa¬ 
tionally  expensive  and  typically  requires  parameter  learning  and  inference  as  a  subroutines.  Second  general 
structure  learning  algorithms,  for  obvious  reasons,  lack  the  graphical  decomposition  that  was  central  to 
the  GrAD  methodology  and  may  require  reasoning  about  arbitrary  subsets  of  the  data.  Third,  structure 
learning  can  require  large  amounts  of  data  to  infer  reasonable  structures.  If  we  must  assess  conditional 
independence  between  groups  of  variables  we  will  need  many  joint  observations  of  subsets  of  variables. 
Finally,  we  need  to  define  a  metric  which  effectively  captures  the  notion  of  a  scalable  structure. 

To  overcome  some  of  the  computational  and  data  efficiency  challenges  we  might  pursue  structure  search 
algorithms  that  incrementally  build  upon  simpler  structures.  This  would  ensure  that  we  arc  never  required 
to  do  inference  or  parameter  learning  on  a  potentially  unscalable  or  intractable  structure.  Furthermore,  by 
building  on  earlier  structures  we  may  be  able  to  reuse  previous  inference  estimates  (e.g.,  belief  propagation 
messages)  to  warm  start  the  inference  process.  Furthermore,  by  building  upon  simpler  structures  we  may 
be  able  to  leverage  some  of  the  distributed  representation  of  the  previous  structure. 

In  large-scale  real-world  structure  search  there  is  often  considerable  domain  knowledge  about  which 
variables  might  be  related  and  which  features  may  be  informative.  For  example  in  image  processing 
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we  might  believe  that  nearby  pixels  have  similar  values  but  we  may  not  know  the  exact  local  structure. 
Alternatively,  in  the  social  networking  domain  we  may  assume  that  the  model  structure  is  a  sub-graph 
of  the  overall  social  network.  In  these  cases  we  may  be  able  to  restrict  our  attention  to  a  smaller  set  of 
possible  structures  and  in  some  cases  sub-structures  enabling  a  graphical  decomposition  of  the  search 
process.  Furthermore,  by  restricting  the  search  space  we  may  be  able  to  learn  structures  using  fewer  training 
examples. 

The  remaining  and  potentially  most  interesting  challenge  is  the  definition  of  a  metric  which  captures  the 
notion  of  a  scalable  structure.  Moreover,  what  defines  a  scalable  structure?  The  thesis  work  suggests 
several  potential  criterion.  If  we  desire  a  serializable  execution,  we  would  like  to  minimize  the  chromatic 
number  of  the  underlying  graph.  Alternatively,  if  we  are  most  worried  about  communication  and  work 
balance,  we  might  like  to  identify  a  structure  which  minimizes  cost  of  the  corresponding  vertex  cut.  Finally, 
if  we  want  to  accelerate  the  convergence  of  the  underlying  inference  algorithm  than  we  might  want  to  find 
a  structure  which  has  many  short  paths  between  variables  (e.g.,  expander  graphs).  We  believe  their  are 
exciting  opportunities  in  the  study  of  regularization  for  scalability  and  its  impact  on  model  and  system 
performance. 


8.3  Generalizing  Prioritized  Computation 

One  of  the  key  principles  in  the  GrAD  methodology  is  the  dynamic  prioritization  of  computation.  However, 
defining  informative  metrics  for  prioritization  can  be  challenging.  For  example,  in  our  work  on  belief 
propagation  and  Gibbs  sampling,  we  spent  a  considerable  effort  to  define  informative  heuristics  on 
which  to  prioritize  computation.  If  we  assume  our  algorithm  is  a  contraction  mapping  we  can  use 
the  difference  between  subsequent  computation  steps  as  a  priority  metric  which  greedily  accelerates 
convergence.  However,  many  algorithms  arc  not  a  contraction  mapping  and  so  the  priority  metric  is  less 
obvious.  For  example,  in  our  work  on  CoEM  we  struggled  to  identify  an  informative  definition  of  priorities. 
In  these  cases  we  typically  resorted  to  binary  priorities  (e.g.,  converged  or  not  converged).  Nonetheless, 
we  conjecture  that  the  study  of  prioritized  could  lead  to  accelerated  convergence  for  a  wide  range  of 
algorithms. 


8.4  Dynamic  Work  Balancing 

Dynamic  computation  is  one  of  the  key  principles  of  the  GrAD  design  methodology  which  accelerates 
convergence  and  eliminates  wasted  computation  by  focusing  computational  resources  where  they  can 
make  the  most  progress.  Just  as  with  adaptive  mesh  refinement  in  high-performance  computing  scenarios, 
adaptive  prioritization  focuses  computation  on  sub-regions  of  the  original  problem.  However,  if  the  problem 
was  originally  evenly  distributed  over  the  cluster,  it  is  common  that  some  machines  may  actually  have 
disproportionately  more  work. 

In  Section  3.8.1  we  introduced  over-partitioning  as  a  randomized  strategy  to  address  the  challenges  of  work 
imbalance  due  to  dynamic  computation  in  the  distributed  setting.  While  over-partitioning  is  appealing 
in  its  simplicity  and  can  be  shown  to  improve  work  balance,  it  increases  communication  throughout  the 
execution  of  the  system.  Alternatively,  we  could  consider  occasionally  repartitioning  the  graph,  as  is 
commonly  done  with  mesh  refinement  in  high-performance  computing.  However,  we  found  in  our  work  on 
belief  propagation,  the  dynamic  computation  can  be  relatively  unpredictable.  Instead,  we  believe  that  work 
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stealing  may  be  more  effective  in  these  situations.  Work  stealing  has  the  advantage  of  moving  work  only 
when  imbalance  arrises  and  can  be  used  to  address  imbalance  in  computation  as  well  as  lagging  processor 
resources.  Power  and  Li  [2010]  developed  a  block  based  work-stealing  mechanism  which  works  well 
in  distributed  environments  to  address  both  imbalance  in  the  processing  time  of  blocks  as  well  as  slow 
processors. 


8.5  Declarative  Abstractions  for  Recurrence  Resolution 

Many  of  the  GrAD  algorithms  effectively  encode  graph  computation  by  modeling  a  recursive  relationship 
between  a  vertex  and  its  neighboring  vertices.  The  GraphLab  and  PowerGraph  abstractions  imperatively 
express  how  to  enforce  and  repair  these  recursive  relationship  and  prioritize  corrections  to  neighboring 
vertices.  In  most  cases  the  process  of  repairing  these  recurrence  relationships  is  to  directly  recompute 
the  recurrence  locally  and  trigger  any  neighbors  to  be  recomputed  if  their  recurrence  is  violated.  As  a 
consequence,  it  may  be  possible  to  express  these  algorithms  (or  even  models)  at  a  higher  more  declarative 
level  freeing  the  system  to  determine  how  and  when  to  prioritize  computation,  enabling  composition  of 
multiple  recurrence  relations,  and  simplifying  the  transition  to  online  computation. 

Instead  of  expressing  the  algorithm  to  compute  and  preserve  a  local  recurrence  relationship,  a  declarative 
formulation  of  GrAD  algorithms  would  instead  specify  the  value  of  a  vertex  attribute  as  a  function  of  its 
neighborhood.  To  assess  approximation  convergence  (e.g.,  re- approximation)  and  enable  prioritization 
we  may  also  want  to  introduce  a  measure  of  recurrence  violation  and  a  acceptable  convergence  threshold 
e.  For  example,  the  PageRank  algorithm  could  be  expressed  compactly  as  just  the  PageRank  equation,  a 
measure  of  change,  and  a  termination  condition: 

R(v)  •<-  —  +  (1  —  a)  Y  wu,v  x  R (u) 
n 

u  links  to  v 

Change(R(n)old,  R(u)new)  :=  |R(n)old  -  R(n)new| 

e  :=  0.005 

By  expressing  PageRank  as  recurrence  relationship,  a  measure  of  change,  and  a  termination  condition, 
we  free  the  system  and  execution  environment  to  reason  about  scheduling  and  dynamic  computation. 
For  example,  the  system  could  optimize  over  different  scheduling  primitives.  In  addition,  the  recurrence 
formulation  can  automatically  be  translated  into  an  online  algorithm.  For  example,  if  the  graph  structure 
were  to  change  then  the  system  could  directly  assess  if  the  recurrence  had  been  violated  and  then  choose  to 
trigger  future  computation  as  needed.  Finally,  we  could  introduce  addition  recurrence  relationships  that 
depend  on  the  values  of  other  recurrence  relationships.  For  example,  we  might  associated  the  marketing 
value  of  a  user  with  the  PageRank  of  that  user  as  well  as  the  marketing  values  of  neighboring  users. 


8.6  The  GrAD  Methodology  and  General  Graph  Computation 

The  GrAD  methodology  could  potentially  be  applied  more  broadly  to  general  graph  computation.  The 
asynchronous  prioritized  approach  to  graphical  model  inference  as  well  as  many  of  the  analysis  techniques 
could  be  naturally  applied  to  discrete  simulations  and  the  evaluations  of  large-scale  partial  differential 
equations.  In  addition  the  GraphLab  and  PowerGraph  abstractions  could  also  be  used  to  simplify  the  design 
and  implementation  of  graphical  computation. 
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Chapter  9 


Summary  of  Notation  and  Terminology 


While  we  have  attempted  to  keep  each  chapter  as  self-contained  and  use  consistent  terminology  and 
notation,  it  can  still  be  difficult  to  track  the  various  terminology  and  notation  through  this  document.  In  this 
chapter  we  provide  a  brief  summary. 


9.1  Terminology 


GrAD  Methodology:  The  The  Graphical,  Asynchronous,  Dynamic,  (GrAD),  design  methodology  is 
composed  of  the  following  three  basic  principles: 

1.  Graphical:  Explicitly  represent  statistical  and  computational  dependencies  as  a  sparse  graph.  The 
sparse  graphical  representation  enables  the  parallel  decomposition  of  algorithms  and  exposes  the 
data  locality  needed  for  efficient  distributed  computation. 

2.  Asynchronous:  Allow  computation  to  proceed  asynchronously  as  dependencies  are  satisfied  obeying 
the  dependency  structure  encoded  in  the  graph.  By  allowing  computation  to  run  asynchronously 
as  dependencies  are  satisfied  we  can  reduce  wasted  computation  and  build  more  efficient  parallel 
and  distributed  systems.  By  obeying  the  dependency  structure  we  can  obtain  provably  correct 
asynchronous  algorithms. 

3.  Dynamic:  Adaptively  identify  and  prioritize  sub-problems  that  when  solved  make  the  most  progress 
towards  the  solution.  By  first  identifying  and  solving  the  key  sub-problems  we  can  reduce  the  overall 
work  and  more  effectively  use  computational  resources. 

Asynchronous  Computation:  Computation  that  runs  as  new  data  arrives  and  resources  become  available 
without  a  single  centralized  clock  or  source  of  coordination.  Asynchronous  algorithms  typically  use  the 
most  recent  information  available  and  communicate  changes. 

Synchronous  Computation:  All  computation  runs  simultaneously  in  phases  that  are  coordinated  by  a 
single  clock.  Typically  communication  only  occurs  between  phases  and  therefore  computation  operates  on 
the  output  of  the  previous  phase. 

Factor  Graph:  A  representation  of  a  graphical  model  as  a  bipartite  graph  with  variables  on  one  side 
connected  to  the  clique  factors  on  the  other  side. 
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Markov  Random  Field:  The  Markov  random  field  of  a  graphical  model  is  the  graph  over  the  variables  in 
which  each  variable  is  connected  to  all  variables  in  its  Markov  blanket. 


Markov  Blanket:  The  Markov  blanket  of  a  variable  are  the  set  of  variables  which  when  observed  separate 
render  the  variable  independent  from  the  rest  of  the  variables  in  the  model. 


9.2  Notation 

Joint  Probability  P  (X \ . . . . ,  Xn ) :  The  join  probability  function  for  the  variables  X\  to  Xn  describes  the 
probability  of  particular  assignment  to  all  random  variables.  In  this  work  we  focus  on  discrete  random 
variables  but  many  of  the  techniques  extend  to  continuous  random  variables. 

Number  of  variables/vertices  ]nunwars:  We  use  n  to  represent  the  number  of  variables  in  the  model  (or 
vertices  in  the  graph). 

Number  of  processors  p:  We  denote  the  number  of  individual  computational  units  (really  threads)  by 
P- 

Conditional  Probability  P  ( X*  |  X,):  The  conditional  probability  functions  describes  the  probability  of 
X,  given  (denoted  by  |  ’)  a  particular  value  for  X} . 

Model  parameters  9:  We  use  9  to  denote  the  model  parameters  of  a  graphical  model.  The  model 
parameters  along  with  the  graphical  structure  define  the  joint  probability  function. 

Clique  set  T\  The  set  A  e  T  consists  of  all  maximal  cliques  A  in  the  Markov  random  field. 

Model  Parameters  9:  The  model  parameters  encode  the  strength  and  form  of  the  relationships  between 
connected  variables  in  a  graphical  models. 

Clique  factors  /a(xa  |  9):  The  clique  factors  in  a  factor  graph  encode  the  local  relationship  between  the 
variables  in  the  set  A  C  {1 .... .  n\  as  a  positive  function  which  depends  on  the  parameters  9. 

Partition  Function  Z{9):  The  partition  function  encodes  the  normalizing  constant  for  the  joint  probability 
function  and  depends  only  on  the  model  parameters  9. 

The  sample  space  Xr  of  variable  X, :  The  sample  space  of  a  variable  is  the  set  of  possible  value  the 
variable  can  take. 

Neighbors  A fp.  The  set  of  neighbors  of  vertex  i  is  denoted  by  Nt. 

Belief  Propagation  messages  and  mfA^x,  (xi)  The  belief  propagation  messages  are  itera¬ 

tively  refined  in  the  belief  propagation  update  equations.  The  messages  can  be  related  to  the  Lagrange 
multipliers  in  the  Bethe  approximation  or  in  tree  graphical  models  the  messages  encode  the  contribution  to 
the  variable  marginal  after  eliminating  all  variables  in  the  originating  subtree. 

Belief  P  (Xa  =  xa)  ~  &x a(xa):  The  posterior  marginals  (beliefs)  estimated  by  the  belief  propagation 
algorithm. 

t; -approximation:  The  rf -approximation  accounts  for  factor  dependent  serial  structure  of  approximate 
inference.  The  number  of  iterations  (or  distance)  re  is  the  minimum  number  of  iterations  (or  distance  that  a 
message  must  travel)  to  achieve  an  e  level  approximation  of  the  message  at  convergence. 

Expectation  E  [X]:  The  expected  value  of  the  random  variable  X. 
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Transition  kernel  K(X'  \  X):  The  transition  kernel  of  a  Gibbs  sampler  assigns  the  probability  of  transition 
from  state  X  to  state  X' . 

Junction  Tree  ( C ,  E)  =  J^:  The  set  of  vertices  C  6  C  is  called  a  clique  and  is  a  subset  of  the  vertices  (i.e., 
CCS)  in  the  Splash  S.  The  cliques  satisfy  the  constraint  that  for  every  factor  domain  A  e  T  there  exists 
a  clique  C  €  C  such  that  A  fl  6”  C  C.  The  edges  E  of  the  junction  tree  satisfy  the  running  intersection 
property  (RIP)  which  ensures  that  all  cliques  sharing  a  common  variable  form  a  connected  tree. 

PageRank  Reset  Probability  7:  In  the  random  surfer  model  there  is  a  small  probability,  typically  7  =  0.15 
of  jumping  to  a  random  webpage. 

Vertex  Data  Du  and  edge  data  Du^.v:  The  GraphLab  and  PowerGraph  abstractions  associate  arbitrary 
user  defined  data  with  each  vertex  and  directed  edge  in  the  data  graph. 

Accumulator  au:  The  Gather  phase  of  the  PowerGraph  abstraction  accumulates  the  result  of  the  gather 
function  in  the  accumulator  au  for  the  vertex  u.  The  accumulator  au  is  then  saved  for  that  vertex  and 
updated  using  the  scatter  function. 

Degree  D  [1;]  of  vertex  v:  The  degree  (|jV’„|)  is  denoted  by  D  [u] . 

Power-Law  constant  a:  The  skewness  of  a  power-law  degree  distribution  is  determined  by  the  power-law 
constant  which  is  typically  a  «  2. 

Harmonic  function  h,.  (a) :  The  harmonic  function  is  the  normalizer  of  the  discrete  bounded  Zipf 
distribution  (  hT  (a)  =  1  d~a ). 

Assignment  function  A(v):  The  partition  assignment  function  A{v)  maps  a  vertex  (or  edge)  to  the  set  of 
machines  on  which  it  is  located. 
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